DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices
@ 2020-05-27 17:02 Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
                   ` (7 more replies)
  0 siblings, 8 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, liang.j.ma

This patchset proposes a simple API for Ethernet drivers
to cause the CPU to enter a power-optimized state while
waiting for packets to arrive, along with a set of
(hopefully generic) intrinsics that facilitate that. This
is achieved through cooperation with the NIC driver that
will allow us to know address of the next NIC RX ring
packet descriptor, and wait for writes on it.

On IA, this is achieved through using UMONITOR/UMWAIT
instructions. They are used in their raw opcode form
because there is no widespread compiler support for
them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen
to implement similar instructions.

To achieve power savings, there is a very simple mechanism
used: we're counting empty polls, and if a certain threshold
is reached, we get the address of next RX ring descriptor
from the NIC driver, arm the monitoring hardware, and
enter a power-optimized state. We will then wake up when
either a timeout happens, or a write happens (or generally
whenever CPU feels like waking up - this is platform-
specific), and proceed as normal. The empty poll counter is
reset whenever we actually get packets, so we only go to
sleep when we know nothing is going on.

Why are we putting it into ethdev as opposed to leaving
this up to the application? Our customers specifically
requested a way to do it wit minimal changes to the
application code. The current approach allows to just
flip a switch and automagically have power savings.

There are certain limitations in this patchset right now:
- Currently, only 1:1 core to queue mapping is supported,
  meaning that each lcore must at most handle RX on a
  single queue
- Currently, power management is enabled per-port, not
  per-queue
- There is potential to greatly increase TX latency if we
  are buffering things, and go to sleep before sending
  packets
- The API is not perfect and could use some improvement
  and discussion
- The API doesn't extend to other device types
- The intrinsics are platform-specific, so ethdev has
  some platform-specific code in it
- Support was only implemented for devices using
  net/ixgbe, net/i40e and net/ice drivers

Hopefully this would generate enough feedback to clear
a path forward!

Anatoly Burakov (6):
  eal: add power management intrinsics
  ethdev: add simple power management API
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  app/testpmd: add command for power management on a port

 app/test-pmd/cmdline.c                        |  48 +++++++
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  23 +++
 drivers/net/i40e/i40e_rxtx.h                  |   2 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  23 +++
 drivers/net/ice/ice_rxtx.h                    |   2 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  22 +++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
 .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 lib/librte_ethdev/rte_ethdev.c                |  39 +++++
 lib/librte_ethdev/rte_ethdev.h                |  70 +++++++++
 lib/librte_ethdev/rte_ethdev_core.h           |  41 +++++-
 lib/librte_ethdev/rte_ethdev_version.map      |   4 +
 20 files changed, 480 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-28 11:39   ` Ananyev, Konstantin
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
  2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, Konstantin Ananyev, david.hunt, liang.j.ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 203 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..8646c4ac16
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index bc73ec2c5c..b54a2be4f6 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -59,6 +59,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..94d6a43763 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
 	RTE_CPUFLAG_EM64T,                  /**< EM64T */
 
+	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
 	/* (EAX 80000007h) EDX features */
 	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
 
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..a0522400fb
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,134 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+	rte_mb();
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 2/6] ethdev: add simple power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-28 12:15   ` Ananyev, Konstantin
  2020-05-27 17:02 ` [dpdk-dev] [RFC 3/6] net/ixgbe: implement " Anatoly Burakov
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev
  Cc: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella,
	Neil Horman, david.hunt, liang.j.ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 queue use case as there is no
coordination between queues/cores in ethdev.

The TSC timestamp is automatically calculated using current link
speed and RX descriptor ring size, such that the sleep time is
not longer than it would take for a NIC to fill its entire RX
descriptor ring.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.c           | 39 +++++++++++++
 lib/librte_ethdev/rte_ethdev.h           | 70 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_core.h      | 41 +++++++++++++-
 lib/librte_ethdev/rte_ethdev_version.map |  4 ++
 4 files changed, 152 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 8e10a6fc36..0be5ecfc11 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -16,6 +16,7 @@
 #include <netinet/in.h>
 
 #include <rte_byteorder.h>
+#include <rte_cpuflags.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_interrupts.h>
@@ -5053,6 +5054,44 @@ rte_eth_dev_pool_ops_supported(uint16_t port_id, const char *pool)
 	return (*dev->dev_ops->pool_ops_supported)(dev, pool);
 }
 
+int
+rte_eth_dev_power_mgmt_enable(uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+		return -ENOTSUP;
+
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+		sizeof(struct rte_eth_ep_stat) * RTE_MAX_QUEUES_PER_PORT,
+		0, dev->data->numa_node);
+
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_eth_dev_power_mgmt_disable(uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* rte_free ignores NULL so safe to call without checks */
+	rte_free(dev->empty_poll_stats);
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+	return 0;
+}
+
 /**
  * A set of values to describe the possible states of a switch domain.
  */
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index a49242bcd2..b8318f7e91 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -666,6 +667,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1490,6 +1492,16 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+/**
+ * Possible power managment states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
@@ -4302,6 +4314,38 @@ __rte_experimental
 int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
 				       struct rte_eth_hairpin_cap *cap);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_enable(uint16_t port_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_disable(uint16_t port_id);
+
 #include <rte_ethdev_core.h>
 
 /**
@@ -4417,6 +4461,32 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 		} while (cb != NULL);
 	}
 #endif
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+			dev->empty_poll_stats[queue_id].num++;
+			if (unlikely(dev->empty_poll_stats[queue_id].num >
+					ETH_EMPTYPOLL_MAX)) {
+				volatile void *target_addr;
+				uint64_t expected, mask;
+				int ret;
+
+				/*
+				 * get address of next descriptor in the RX
+				 * ring for this queue, as well as expected
+				 * value and a mask.
+				 */
+				ret = (*dev->dev_ops->next_rx_desc)
+					(dev->data->rx_queues[queue_id],
+					 &target_addr, &expected, &mask);
+				if (ret == 0)
+					/* -1ULL is maximum value for TSC */
+					rte_power_monitor(target_addr,
+							  expected, mask,
+							  0, -1ULL);
+			}
+		} else
+			dev->empty_poll_stats[queue_id].num = 0;
+	}
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
 	return nb_rx;
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd418..4e23d465f0 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,27 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ *
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +773,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +791,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +838,14 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	uint32_t reserved_32;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	void *reserved_ptrs[3];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index 7155056045..141361823d 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -241,4 +241,8 @@ EXPERIMENTAL {
 	__rte_ethdev_trace_rx_burst;
 	__rte_ethdev_trace_tx_burst;
 	rte_flow_get_aged_flows;
+
+	# added in 20.08
+	rte_eth_dev_power_mgmt_disable;
+	rte_eth_dev_power_mgmt_enable;
 };
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 3/6] net/ixgbe: implement power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 4/6] net/i40e: " Anatoly Burakov
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Wei Zhao, Jeff Guo, david.hunt, liang.j.ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index a4e5c539de..190d11d98d 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -605,6 +605,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 2e20e18c7a..ef2fb5fca9 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 20a8b291d4..6c35966c78 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 4/6] net/i40e: implement power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (2 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 3/6] net/ixgbe: implement " Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 5/6] net/ice: " Anatoly Burakov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Beilei Xing, Jeff Guo, david.hunt, liang.j.ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 749d85f544..f3ce54911b 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -526,6 +526,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5e7c86ed82..76dfbb2098 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 8f11f011a7..72d810475b 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -245,6 +245,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 5/6] net/ice: implement power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (3 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 4/6] net/i40e: " Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port Anatoly Burakov
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Qiming Yang, Wenzhuo Lu, david.hunt, liang.j.ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index d5110c4392..db8269a548 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -219,6 +219,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 1c9f31efdf..80fd6bd134 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d04..7eb6fa904e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (4 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 5/6] net/ice: " Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  7 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu, Beilei Xing, Bernard Iremonger, david.hunt, liang.j.ma

A quick-and-dirty testpmd command to enable power management on
a specific port.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 app/test-pmd/cmdline.c | 48 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 996a498768..e3a5e19485 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -1773,6 +1773,53 @@ cmdline_parse_inst_t cmd_config_speed_specific = {
 	},
 };
 
+/* *** enable power management for specific port *** */
+struct cmd_port_pmgmt {
+	cmdline_fixed_string_t port;
+	portid_t id;
+	cmdline_fixed_string_t pmgmt;
+	cmdline_fixed_string_t on;
+};
+
+static void
+cmd_port_pmgmt_parsed(void *parsed_result,
+				__rte_unused struct cmdline *cl,
+				__rte_unused void *data)
+{
+	struct cmd_port_pmgmt *res = parsed_result;
+
+	if (port_id_is_invalid(res->id, ENABLED_WARN))
+		return;
+
+	if (!strcmp(res->on, "on"))
+		rte_eth_dev_power_mgmt_enable(res->id);
+	else if (!strcmp(res->on, "off"))
+		rte_eth_dev_power_mgmt_disable(res->id);
+}
+
+
+cmdline_parse_token_string_t cmd_port_pmgmt_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_pmgmt, port, "port");
+cmdline_parse_token_num_t cmd_port_pmgmt_id =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_pmgmt, id, UINT16);
+cmdline_parse_token_string_t cmd_port_pmgmt_item1 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_pmgmt, pmgmt, "power-mgmt");
+cmdline_parse_token_string_t cmd_port_pmgmt_value1 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_pmgmt, on, "on#off");
+
+cmdline_parse_inst_t cmd_port_pmgmt = {
+	.f = cmd_port_pmgmt_parsed,
+	.data = NULL,
+	.help_str = "port <port_id> power-mgmt on|off",
+	.tokens = {
+		(void *)&cmd_port_pmgmt_port,
+		(void *)&cmd_port_pmgmt_id,
+		(void *)&cmd_port_pmgmt_item1,
+		(void *)&cmd_port_pmgmt_value1,
+		NULL,
+	},
+};
+
 /* *** configure loopback for all ports *** */
 struct cmd_config_loopback_all {
 	cmdline_fixed_string_t port;
@@ -19692,6 +19739,7 @@ cmdline_parse_ctx_t main_ctx[] = {
 	(cmdline_parse_inst_t *)&cmd_show_set_raw,
 	(cmdline_parse_inst_t *)&cmd_show_set_raw_all,
 	(cmdline_parse_inst_t *)&cmd_config_tx_dynf_specific,
+	(cmdline_parse_inst_t *)&cmd_port_pmgmt,
 	NULL,
 };
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (5 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port Anatoly Burakov
@ 2020-05-27 17:33 ` Jerin Jacob
  2020-05-27 20:57   ` Stephen Hemminger
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  7 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-05-27 17:33 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dpdk-dev, David Hunt, Liang Ma

On Wed, May 27, 2020 at 10:32 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> This patchset proposes a simple API for Ethernet drivers
> to cause the CPU to enter a power-optimized state while
> waiting for packets to arrive, along with a set of
> (hopefully generic) intrinsics that facilitate that. This
> is achieved through cooperation with the NIC driver that
> will allow us to know address of the next NIC RX ring
> packet descriptor, and wait for writes on it.
>
> On IA, this is achieved through using UMONITOR/UMWAIT
> instructions. They are used in their raw opcode form
> because there is no widespread compiler support for
> them yet. Still, the API is made generic enough to
> hopefully support other architectures, if they happen
> to implement similar instructions.
>
> To achieve power savings, there is a very simple mechanism
> used: we're counting empty polls, and if a certain threshold
> is reached, we get the address of next RX ring descriptor
> from the NIC driver, arm the monitoring hardware, and
> enter a power-optimized state. We will then wake up when
> either a timeout happens, or a write happens (or generally
> whenever CPU feels like waking up - this is platform-
> specific), and proceed as normal. The empty poll counter is
> reset whenever we actually get packets, so we only go to
> sleep when we know nothing is going on.
>
> Why are we putting it into ethdev as opposed to leaving
> this up to the application? Our customers specifically
> requested a way to do it wit minimal changes to the
> application code. The current approach allows to just
> flip a switch and automagically have power savings.
>
> There are certain limitations in this patchset right now:
> - Currently, only 1:1 core to queue mapping is supported,
>   meaning that each lcore must at most handle RX on a
>   single queue
> - Currently, power management is enabled per-port, not
>   per-queue
> - There is potential to greatly increase TX latency if we
>   are buffering things, and go to sleep before sending
>   packets
> - The API is not perfect and could use some improvement
>   and discussion
> - The API doesn't extend to other device types
> - The intrinsics are platform-specific, so ethdev has
>   some platform-specific code in it
> - Support was only implemented for devices using
>   net/ixgbe, net/i40e and net/ice drivers
>
> Hopefully this would generate enough feedback to clear
> a path forward!

Just for my understanding:

How/Is this solution is superior than Rx queue interrupt based scheme that
applied in l3fwd-power?

What I meant by superior here, as an example,
a)Is there any power savings in mill watt vs interrupt scheme?
b) Is there improvement on time reduction between switching from/to a
different state
(i.e how fast it can move from low power state to full power state) vs
interrupt scheme.
etc

or This just for just pushing all the logic to ethdev so that
applications can be transparent?


>
> Anatoly Burakov (6):
>   eal: add power management intrinsics
>   ethdev: add simple power management API
>   net/ixgbe: implement power management API
>   net/i40e: implement power management API
>   net/ice: implement power management API
>   app/testpmd: add command for power management on a port
>
>  app/test-pmd/cmdline.c                        |  48 +++++++
>  drivers/net/i40e/i40e_ethdev.c                |   1 +
>  drivers/net/i40e/i40e_rxtx.c                  |  23 +++
>  drivers/net/i40e/i40e_rxtx.h                  |   2 +
>  drivers/net/ice/ice_ethdev.c                  |   1 +
>  drivers/net/ice/ice_rxtx.c                    |  23 +++
>  drivers/net/ice/ice_rxtx.h                    |   2 +
>  drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
>  drivers/net/ixgbe/ixgbe_rxtx.c                |  22 +++
>  drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
>  .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
>  lib/librte_eal/x86/rte_cpuflags.c             |   2 +
>  lib/librte_ethdev/rte_ethdev.c                |  39 +++++
>  lib/librte_ethdev/rte_ethdev.h                |  70 +++++++++
>  lib/librte_ethdev/rte_ethdev_core.h           |  41 +++++-
>  lib/librte_ethdev/rte_ethdev_version.map      |   4 +
>  20 files changed, 480 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
>
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices
  2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
@ 2020-05-27 20:57   ` Stephen Hemminger
  0 siblings, 0 replies; 421+ messages in thread
From: Stephen Hemminger @ 2020-05-27 20:57 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Anatoly Burakov, dpdk-dev, David Hunt, Liang Ma

On Wed, 27 May 2020 23:03:59 +0530
Jerin Jacob <jerinjacobk@gmail.com> wrote:

> On Wed, May 27, 2020 at 10:32 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
> >
> > This patchset proposes a simple API for Ethernet drivers
> > to cause the CPU to enter a power-optimized state while
> > waiting for packets to arrive, along with a set of
> > (hopefully generic) intrinsics that facilitate that. This
> > is achieved through cooperation with the NIC driver that
> > will allow us to know address of the next NIC RX ring
> > packet descriptor, and wait for writes on it.
> >
> > On IA, this is achieved through using UMONITOR/UMWAIT
> > instructions. They are used in their raw opcode form
> > because there is no widespread compiler support for
> > them yet. Still, the API is made generic enough to
> > hopefully support other architectures, if they happen
> > to implement similar instructions.
> >
> > To achieve power savings, there is a very simple mechanism
> > used: we're counting empty polls, and if a certain threshold
> > is reached, we get the address of next RX ring descriptor
> > from the NIC driver, arm the monitoring hardware, and
> > enter a power-optimized state. We will then wake up when
> > either a timeout happens, or a write happens (or generally
> > whenever CPU feels like waking up - this is platform-
> > specific), and proceed as normal. The empty poll counter is
> > reset whenever we actually get packets, so we only go to
> > sleep when we know nothing is going on.
> >
> > Why are we putting it into ethdev as opposed to leaving
> > this up to the application? Our customers specifically
> > requested a way to do it wit minimal changes to the
> > application code. The current approach allows to just
> > flip a switch and automagically have power savings.
> >
> > There are certain limitations in this patchset right now:
> > - Currently, only 1:1 core to queue mapping is supported,
> >   meaning that each lcore must at most handle RX on a
> >   single queue
> > - Currently, power management is enabled per-port, not
> >   per-queue
> > - There is potential to greatly increase TX latency if we
> >   are buffering things, and go to sleep before sending
> >   packets
> > - The API is not perfect and could use some improvement
> >   and discussion
> > - The API doesn't extend to other device types
> > - The intrinsics are platform-specific, so ethdev has
> >   some platform-specific code in it
> > - Support was only implemented for devices using
> >   net/ixgbe, net/i40e and net/ice drivers
> >
> > Hopefully this would generate enough feedback to clear
> > a path forward!  
> 
> Just for my understanding:
> 
> How/Is this solution is superior than Rx queue interrupt based scheme that
> applied in l3fwd-power?
> 
> What I meant by superior here, as an example,
> a)Is there any power savings in mill watt vs interrupt scheme?
> b) Is there improvement on time reduction between switching from/to a
> different state
> (i.e how fast it can move from low power state to full power state) vs
> interrupt scheme.
> etc
> 
> or This just for just pushing all the logic to ethdev so that
> applications can be transparent?
> 

The interrupt scheme is going to get better power management since
the core can go to WAIT. This scheme does look interesting in theory
since it will be lower latency.

but has a number of issues:
  * changing drivers
  * can not multiplex multiple queues per core; you are assuming
    a certain threading model
  * what if thread is preempted
  * what about thread in a VM
  * platform specific: ARM and x86 have different semantics here


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
@ 2020-05-28 11:39   ` Ananyev, Konstantin
  2020-05-28 14:40     ` Burakov, Anatoly
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-05-28 11:39 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli

Hi Anatoly,

> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.

Recently ARM guys introduced new generic API
for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
Probably would make sense to unite both APIs into something common
and HW transparent. 
Konstantin

> 
> Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
>  lib/librte_eal/x86/rte_cpuflags.c             |   2 +
>  6 files changed, 203 insertions(+)
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
> 
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..8646c4ac16
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp);
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index bc73ec2c5c..b54a2be4f6 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -59,6 +59,7 @@ generic_headers = files(
>  	'generic/rte_memcpy.h',
>  	'generic/rte_pause.h',
>  	'generic/rte_prefetch.h',
> +	'generic/rte_power_intrinsics.h',
>  	'generic/rte_rwlock.h',
>  	'generic/rte_spinlock.h',
>  	'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>  	'rte_io.h',
>  	'rte_memcpy.h',
>  	'rte_prefetch.h',
> +	'rte_power_intrinsics.h',
>  	'rte_pause.h',
>  	'rte_rtm.h',
>  	'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d1..94d6a43763 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
>  	RTE_CPUFLAG_EM64T,                  /**< EM64T */
> 
> +	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
>  	/* (EAX 80000007h) EDX features */
>  	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
> 
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..a0522400fb
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,134 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	uint64_t rflags;
> +
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +	rte_mb();
> +	if (value_mask) {
> +		const uint64_t cur_value = *(const volatile uint64_t *)p;
> +		const uint64_t masked = cur_value & value_mask;
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return 0;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
> +		/*
> +		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
> +		 * onto the stack, then pop them back into `rflags` so that
> +		 * we can read it.
> +		 */
> +		"pushf;\n"
> +		"pop %0;\n"
> +		: "=r"(rflags)
> +		: "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction. For more information about its usage,
> + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
> + * Manual.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	uint64_t rflags;
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
> +		     /*
> +		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
> +		      * onto the stack, then pop them back into `rflags` so that
> +		      * we can read it.
> +		      */
> +		     "pushf;\n"
> +		     "pop %0;\n"
> +		     : "=r"(rflags)
> +		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 30439e7951..0325c4b93b 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
>  	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
>  	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
> 
> +	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
> +
>  	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
>  	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
> 
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 2/6] ethdev: add simple power management API
  2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
@ 2020-05-28 12:15   ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-05-28 12:15 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko, Ray Kinsella,
	Neil Horman, Hunt, David, Ma, Liang J

> 
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API is limited to 1 core 1 queue use case as there is no
> coordination between queues/cores in ethdev.
> 
> The TSC timestamp is automatically calculated using current link
> speed and RX descriptor ring size, such that the sleep time is
> not longer than it would take for a NIC to fill its entire RX
> descriptor ring.
> 
> Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_ethdev/rte_ethdev.c           | 39 +++++++++++++
>  lib/librte_ethdev/rte_ethdev.h           | 70 ++++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_core.h      | 41 +++++++++++++-
>  lib/librte_ethdev/rte_ethdev_version.map |  4 ++
>  4 files changed, 152 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 8e10a6fc36..0be5ecfc11 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -16,6 +16,7 @@
>  #include <netinet/in.h>
> 
>  #include <rte_byteorder.h>
> +#include <rte_cpuflags.h>
>  #include <rte_log.h>
>  #include <rte_debug.h>
>  #include <rte_interrupts.h>
> @@ -5053,6 +5054,44 @@ rte_eth_dev_pool_ops_supported(uint16_t port_id, const char *pool)
>  	return (*dev->dev_ops->pool_ops_supported)(dev, pool);
>  }
> 
> +int
> +rte_eth_dev_power_mgmt_enable(uint16_t port_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
> +		return -ENOTSUP;
> +
> +	/* allocate memory for empty poll stats */
> +	dev->empty_poll_stats = rte_malloc_socket(NULL,
> +		sizeof(struct rte_eth_ep_stat) * RTE_MAX_QUEUES_PER_PORT,
> +		0, dev->data->numa_node);
> +
> +	if (dev->empty_poll_stats == NULL)
> +		return -ENOMEM;
> +
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
> +	return 0;
> +}
> +
> +int
> +rte_eth_dev_power_mgmt_disable(uint16_t port_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	/* rte_free ignores NULL so safe to call without checks */
> +	rte_free(dev->empty_poll_stats);
> +
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> +	return 0;
> +}
> +
>  /**
>   * A set of values to describe the possible states of a switch domain.
>   */
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index a49242bcd2..b8318f7e91 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -157,6 +157,7 @@ extern "C" {
>  #include <rte_common.h>
>  #include <rte_config.h>
>  #include <rte_ether.h>
> +#include <rte_power_intrinsics.h>
> 
>  #include "rte_ethdev_trace_fp.h"
>  #include "rte_dev_info.h"
> @@ -666,6 +667,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
>  /** Maximum nb. of vlan per mirror rule */
>  #define ETH_MIRROR_MAX_VLANS       64
> 
> +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
>  #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
>  #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
>  #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
> @@ -1490,6 +1492,16 @@ enum rte_eth_dev_state {
>  	RTE_ETH_DEV_REMOVED,
>  };
> 
> +/**
> + * Possible power managment states of an ethdev port.
> + */
> +enum rte_eth_dev_power_mgmt_state {
> +	/** Device power management is disabled. */
> +	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
> +	/** Device power management is enabled. */
> +	RTE_ETH_DEV_POWER_MGMT_ENABLED
> +};
> +
>  struct rte_eth_dev_sriov {
>  	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
>  	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
> @@ -4302,6 +4314,38 @@ __rte_experimental
>  int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
>  				       struct rte_eth_hairpin_cap *cap);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable device power management.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_eth_dev_power_mgmt_enable(uint16_t port_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Disable device power management.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_eth_dev_power_mgmt_disable(uint16_t port_id);
> +
>  #include <rte_ethdev_core.h>
> 
>  /**
> @@ -4417,6 +4461,32 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>  		} while (cb != NULL);
>  	}
>  #endif
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +			dev->empty_poll_stats[queue_id].num++;
> +			if (unlikely(dev->empty_poll_stats[queue_id].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +				volatile void *target_addr;
> +				uint64_t expected, mask;
> +				int ret;
> +
> +				/*
> +				 * get address of next descriptor in the RX
> +				 * ring for this queue, as well as expected
> +				 * value and a mask.
> +				 */
> +				ret = (*dev->dev_ops->next_rx_desc)
> +					(dev->data->rx_queues[queue_id],
> +					 &target_addr, &expected, &mask);

That makes every PMD that doesn't support next_rx_desc op to crash.
One simple way to avoid it - check in rte_eth_dev_power_mgmt_enable() that PMD
does implement ops->next_rx_desc.
Though I don't think introducing such new op is a best approach, as it implies
that PMD does have HW RX descriptor mapped into WB-type memory, and dictates 
to PMD on what it should sleep on.
Though depending on HW/SW capabilities and implementation PMD might choose to
sleep on different thing (HW doorbell, SW cond var, etc.).
Another thing - I doubt it is a good idea to pollute generic RX function with power
specific code (again, as I said above it probably wouldn't be that generic for all possible PMDs).
From my perspective we have 2 alternatives to implement such functionality:
1. Keep rte_eth_dev_power_mgmt_enable/disable(port, queue) and move actual 
    *wait_on* code into the PMD RX implementations (we probably can still have some common.      
    logic about allowed number of empty polls, max timeout to sleep, etc.).
2. Drop rte_eth_dev_power_mgmt_enable/disable and introduce explicit:
    rte_eth_dev_wait_for_packet(port, queue, timeout)  API function.
    
In both cases PMD will have a full freedom to implement *wait_on_packet* functionality 
in a most convenient way.
For 2) user would have to do some extra work himself
(count number of consecutive empty polls, call *wait_on_packet* function explicitly).
Though I think it can be easily hidden inside some wrapper API on top
of rte_eth_rx_burst()/rte_eth-dev_wait_for_packet().
Something like rte_eth_rx_burst_wait() or so.
We can have logic about allowed number of empty polls,
might be some other conditions in that top level function.
In that case changes in the user app will still be minimal. 
From other side 2) gives user explicit control on where and when to sleep,
so from my perspective it seems more straightforward and flexible.

> +				if (ret == 0)
> +					/* -1ULL is maximum value for TSC */
> +					rte_power_monitor(target_addr,
> +							  expected, mask,
> +							  0, -1ULL);
> +			}
> +		} else
> +			dev->empty_poll_stats[queue_id].num = 0;
> +	}
> 
>  	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
>  	return nb_rx;
> diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
> index 32407dd418..4e23d465f0 100644
> --- a/lib/librte_ethdev/rte_ethdev_core.h
> +++ b/lib/librte_ethdev/rte_ethdev_core.h
> @@ -603,6 +603,27 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>  	 uint16_t nb_tx_desc,
>  	 const struct rte_eth_hairpin_conf *hairpin_conf);
> 
> +/**
> + * @internal
> + * Get the next RX ring descriptor address.
> + *
> + * @param rxq
> + *   ethdev queue pointer.
> + * @param tail_desc_addr
> + *   the pointer point to descriptor address var.
> + *
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success.
> + * @retval -EINVAL
> + *   Failed to get descriptor address.
> + */
> +typedef int (*eth_next_rx_desc_t)
> +	(void *rxq, volatile void **tail_desc_addr,
> +	 uint64_t *expected, uint64_t *mask);
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -752,6 +773,8 @@ struct eth_dev_ops {
>  	/**< Set up device RX hairpin queue. */
>  	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>  	/**< Set up device TX hairpin queue. */
> +	eth_next_rx_desc_t next_rx_desc;
> +	/**< Get next RX ring descriptor address. */
>  };
> 
>  /**
> @@ -768,6 +791,14 @@ struct rte_eth_rxtx_callback {
>  	void *param;
>  };
> 
> +/**
> + * @internal
> + * Structure used to hold counters for empty poll
> + */
> +struct rte_eth_ep_stat {
> +	uint64_t num;
> +} __rte_cache_aligned;
> +
>  /**
>   * @internal
>   * The generic data structure associated with each ethernet device.
> @@ -807,8 +838,14 @@ struct rte_eth_dev {
>  	enum rte_eth_dev_state state; /**< Flag indicating the port state */
>  	void *security_ctx; /**< Context for security ops */
> 
> -	uint64_t reserved_64s[4]; /**< Reserved for future fields */
> -	void *reserved_ptrs[4];   /**< Reserved for future fields */
> +	/**< Empty poll number */
> +	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
> +	uint32_t reserved_32;
> +	uint64_t reserved_64s[3]; /**< Reserved for future fields */
> +
> +	/**< Flag indicating the port power state */
> +	struct rte_eth_ep_stat *empty_poll_stats;
> +	void *reserved_ptrs[3];   /**< Reserved for future fields */
>  } __rte_cache_aligned;
> 
>  struct rte_eth_dev_sriov;
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> index 7155056045..141361823d 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -241,4 +241,8 @@ EXPERIMENTAL {
>  	__rte_ethdev_trace_rx_burst;
>  	__rte_ethdev_trace_tx_burst;
>  	rte_flow_get_aged_flows;
> +
> +	# added in 20.08
> +	rte_eth_dev_power_mgmt_disable;
> +	rte_eth_dev_power_mgmt_enable;
>  };
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 11:39   ` Ananyev, Konstantin
@ 2020-05-28 14:40     ` Burakov, Anatoly
  2020-05-28 14:58       ` Bruce Richardson
  2020-05-28 15:38       ` Ananyev, Konstantin
  0 siblings, 2 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-05-28 14:40 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli

On 28-May-20 12:39 PM, Ananyev, Konstantin wrote:
> Hi Anatoly,
> 
>>
>> Add two new power management intrinsics, and provide an implementation
>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>> are implemented as raw byte opcodes because there is not yet widespread
>> compiler support for these instructions.
>>
>> The power management instructions provide an architecture-specific
>> function to either wait until a specified TSC timestamp is reached, or
>> optionally wait until either a TSC timestamp is reached or a memory
>> location is written to. The monitor function also provides an optional
>> comparison, to avoid sleeping when the expected write has already
>> happened, and no more writes are expected.
> 
> Recently ARM guys introduced new generic API
> for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> Probably would make sense to unite both APIs into something common
> and HW transparent.
> Konstantin

Hi Konstantin,

That's not really similar purpose. This is monitoring a cacheline for 
writes, not waiting on a specific value. The "expected" value is there 
as basically a hack to get around the race condition due to the fact 
that by the time you enter monitoring state, the write you're waiting 
for may have already happened.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 14:40     ` Burakov, Anatoly
@ 2020-05-28 14:58       ` Bruce Richardson
  2020-05-28 15:38       ` Ananyev, Konstantin
  1 sibling, 0 replies; 421+ messages in thread
From: Bruce Richardson @ 2020-05-28 14:58 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Ananyev, Konstantin, dev, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli

On Thu, May 28, 2020 at 03:40:18PM +0100, Burakov, Anatoly wrote:
> On 28-May-20 12:39 PM, Ananyev, Konstantin wrote:
> > Hi Anatoly,
> > 
> > > 
> > > Add two new power management intrinsics, and provide an implementation
> > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > are implemented as raw byte opcodes because there is not yet widespread
> > > compiler support for these instructions.
> > > 
> > > The power management instructions provide an architecture-specific
> > > function to either wait until a specified TSC timestamp is reached, or
> > > optionally wait until either a TSC timestamp is reached or a memory
> > > location is written to. The monitor function also provides an optional
> > > comparison, to avoid sleeping when the expected write has already
> > > happened, and no more writes are expected.
> > 
> > Recently ARM guys introduced new generic API
> > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > Probably would make sense to unite both APIs into something common
> > and HW transparent.
> > Konstantin
> 
> Hi Konstantin,
> 
> That's not really similar purpose. This is monitoring a cacheline for
> writes, not waiting on a specific value. The "expected" value is there as
> basically a hack to get around the race condition due to the fact that by
> the time you enter monitoring state, the write you're waiting for may have
> already happened.
> 
Rather than the "expected" value, is it not more useful for a general API
to check for the existing value? Since we are awaiting writes, we may not
know what new value will be written, but we do know what the value is now,
and can just check that it's not been changed.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 14:40     ` Burakov, Anatoly
  2020-05-28 14:58       ` Bruce Richardson
@ 2020-05-28 15:38       ` Ananyev, Konstantin
  2020-05-29  6:56         ` Jerin Jacob
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-05-28 15:38 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli


> > Hi Anatoly,
> >
> >>
> >> Add two new power management intrinsics, and provide an implementation
> >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >> are implemented as raw byte opcodes because there is not yet widespread
> >> compiler support for these instructions.
> >>
> >> The power management instructions provide an architecture-specific
> >> function to either wait until a specified TSC timestamp is reached, or
> >> optionally wait until either a TSC timestamp is reached or a memory
> >> location is written to. The monitor function also provides an optional
> >> comparison, to avoid sleeping when the expected write has already
> >> happened, and no more writes are expected.
> >
> > Recently ARM guys introduced new generic API
> > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > Probably would make sense to unite both APIs into something common
> > and HW transparent.
> > Konstantin
> 
> Hi Konstantin,
> 
> That's not really similar purpose. This is monitoring a cacheline for
> writes, not waiting on a specific value. 

I understand that.

> The "expected" value is there
> as basically a hack to get around the race condition due to the fact
> that by the time you enter monitoring state, the write you're waiting
> for may have already happened.

AFAIK, current rte_wait_until_equal_* does pretty much the same thing:

LDXR memaddr, $reg  // an address to monitor for
if ($reg != expected_value)
   SEVL      //     arm monitor
   do {
       WFE     //      waits for write to that memory address  
       LDXR memaddr, $reg
   } while ($reg != expected_value);   
 
Looks pretty similar to what rte_power_monitor() does,
except you don't have a loop for checking the new value.
Plus rte_power_monitor() provides extra options to the user - 
timestamp and power save mode to enter.
Also I don't know what is the granularity of such events on ARM,
is it a cache-line or more/less.
Might be ARM people can comment/correct me here. 
Konstantin

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 15:38       ` Ananyev, Konstantin
@ 2020-05-29  6:56         ` Jerin Jacob
  2020-06-02 10:15           ` Ananyev, Konstantin
  2020-06-03  6:22           ` Honnappa Nagarahalli
  0 siblings, 2 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-05-29  6:56 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma,
	Liang J, Honnappa.Nagarahalli

On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
> > > Hi Anatoly,
> > >
> > >>
> > >> Add two new power management intrinsics, and provide an implementation
> > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > >> are implemented as raw byte opcodes because there is not yet widespread
> > >> compiler support for these instructions.
> > >>
> > >> The power management instructions provide an architecture-specific
> > >> function to either wait until a specified TSC timestamp is reached, or
> > >> optionally wait until either a TSC timestamp is reached or a memory
> > >> location is written to. The monitor function also provides an optional
> > >> comparison, to avoid sleeping when the expected write has already
> > >> happened, and no more writes are expected.
> > >
> > > Recently ARM guys introduced new generic API
> > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > > Probably would make sense to unite both APIs into something common
> > > and HW transparent.
> > > Konstantin
> >
> > Hi Konstantin,
> >
> > That's not really similar purpose. This is monitoring a cacheline for
> > writes, not waiting on a specific value.
>
> I understand that.
>
> > The "expected" value is there
> > as basically a hack to get around the race condition due to the fact
> > that by the time you enter monitoring state, the write you're waiting
> > for may have already happened.
>
> AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
>
> LDXR memaddr, $reg  // an address to monitor for
> if ($reg != expected_value)
>    SEVL      //     arm monitor
>    do {
>        WFE     //      waits for write to that memory address
>        LDXR memaddr, $reg
>    } while ($reg != expected_value);
>
> Looks pretty similar to what rte_power_monitor() does,
> except you don't have a loop for checking the new value.
> Plus rte_power_monitor() provides extra options to the user -
> timestamp and power save mode to enter.
> Also I don't know what is the granularity of such events on ARM,
> is it a cache-line or more/less.

As I understand it, Granularity is per the cache-line.
ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power
state until the cache line is written.

But I see UMONITOR bit different, Where _without_ other core signaling
to wakeup from wait state,
it can wake on TSC expiry. I think, that's is the main primitive on
this feature. Right?

WFE can also wake based on Timer stream events(kind of TSC in x86
analogy) but it has a configuration
bit that needs to allow for this scheme in userspace(EL0) or not?
defined by EL1(Linux kernel).
I am planning to spend time on this after understanding the value
addition of the feature/usecase[1]
[1]
http://mails.dpdk.org/archives/dev/2020-May/168888.html





> Might be ARM people can comment/correct me here.
> Konstantin

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-29  6:56         ` Jerin Jacob
@ 2020-06-02 10:15           ` Ananyev, Konstantin
  2020-06-03  6:22           ` Honnappa Nagarahalli
  1 sibling, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-06-02 10:15 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma,
	Liang J, Honnappa.Nagarahalli


> 
> On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> > > > Hi Anatoly,
> > > >
> > > >>
> > > >> Add two new power management intrinsics, and provide an implementation
> > > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > >> are implemented as raw byte opcodes because there is not yet widespread
> > > >> compiler support for these instructions.
> > > >>
> > > >> The power management instructions provide an architecture-specific
> > > >> function to either wait until a specified TSC timestamp is reached, or
> > > >> optionally wait until either a TSC timestamp is reached or a memory
> > > >> location is written to. The monitor function also provides an optional
> > > >> comparison, to avoid sleeping when the expected write has already
> > > >> happened, and no more writes are expected.
> > > >
> > > > Recently ARM guys introduced new generic API
> > > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > Probably would make sense to unite both APIs into something common
> > > > and HW transparent.
> > > > Konstantin
> > >
> > > Hi Konstantin,
> > >
> > > That's not really similar purpose. This is monitoring a cacheline for
> > > writes, not waiting on a specific value.
> >
> > I understand that.
> >
> > > The "expected" value is there
> > > as basically a hack to get around the race condition due to the fact
> > > that by the time you enter monitoring state, the write you're waiting
> > > for may have already happened.
> >
> > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> >
> > LDXR memaddr, $reg  // an address to monitor for
> > if ($reg != expected_value)
> >    SEVL      //     arm monitor
> >    do {
> >        WFE     //      waits for write to that memory address
> >        LDXR memaddr, $reg
> >    } while ($reg != expected_value);
> >
> > Looks pretty similar to what rte_power_monitor() does,
> > except you don't have a loop for checking the new value.
> > Plus rte_power_monitor() provides extra options to the user -
> > timestamp and power save mode to enter.
> > Also I don't know what is the granularity of such events on ARM,
> > is it a cache-line or more/less.
> 
> As I understand it, Granularity is per the cache-line.
> ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power
> state until the cache line is written.
> 
> But I see UMONITOR bit different, Where _without_ other core signaling
> to wakeup from wait state,
> it can wake on TSC expiry. I think, that's is the main primitive on
> this feature. Right?
> 
> WFE can also wake based on Timer stream events(kind of TSC in x86
> analogy) but it has a configuration
> bit that needs to allow for this scheme in userspace(EL0) or not?
> defined by EL1(Linux kernel).
> I am planning to spend time on this after understanding the value
> addition of the feature/usecase[1]
> [1]
> http://mails.dpdk.org/archives/dev/2020-May/168888.html
> 

Ok, if there is a consensus to keep these two APIs disjoint for now -
I wouldn't insist.
Konstantin


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-29  6:56         ` Jerin Jacob
  2020-06-02 10:15           ` Ananyev, Konstantin
@ 2020-06-03  6:22           ` Honnappa Nagarahalli
  2020-06-03  6:31             ` Jerin Jacob
  1 sibling, 1 reply; 421+ messages in thread
From: Honnappa Nagarahalli @ 2020-06-03  6:22 UTC (permalink / raw)
  To: Jerin Jacob, Ananyev, Konstantin
  Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma,
	Liang J, nd, nd

<snip>

> 
> On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> > > > Hi Anatoly,
> > > >
> > > >>
> > > >> Add two new power management intrinsics, and provide an
> > > >> implementation in eal/x86 based on UMONITOR/UMWAIT instructions.
> > > >> The instructions are implemented as raw byte opcodes because
> > > >> there is not yet widespread compiler support for these instructions.
> > > >>
> > > >> The power management instructions provide an
> > > >> architecture-specific function to either wait until a specified
> > > >> TSC timestamp is reached, or optionally wait until either a TSC
> > > >> timestamp is reached or a memory location is written to. The
> > > >> monitor function also provides an optional comparison, to avoid
> > > >> sleeping when the expected write has already happened, and no more
> writes are expected.
> > > >
> > > > Recently ARM guys introduced new generic API for similar (as I
> > > > understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > Probably would make sense to unite both APIs into something common
> > > > and HW transparent.
> > > > Konstantin
> > >
> > > Hi Konstantin,
> > >
> > > That's not really similar purpose. This is monitoring a cacheline
> > > for writes, not waiting on a specific value.
> >
> > I understand that.
> >
> > > The "expected" value is there
> > > as basically a hack to get around the race condition due to the fact
> > > that by the time you enter monitoring state, the write you're
> > > waiting for may have already happened.
> >
> > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> >
> > LDXR memaddr, $reg  // an address to monitor for if ($reg !=
> > expected_value)
> >    SEVL      //     arm monitor
> >    do {
> >        WFE     //      waits for write to that memory address
> >        LDXR memaddr, $reg
> >    } while ($reg != expected_value);
> >
> > Looks pretty similar to what rte_power_monitor() does, except you
> > don't have a loop for checking the new value.
> > Plus rte_power_monitor() provides extra options to the user -
> > timestamp and power save mode to enter.
> > Also I don't know what is the granularity of such events on ARM, is it
> > a cache-line or more/less.
> 
> As I understand it, Granularity is per the cache-line.
> ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until
> the cache line is written.
Architecture allows for 16B to 2048B space. Typically, implementations use cache-line granularity.

> 
> But I see UMONITOR bit different, Where _without_ other core signaling to
> wakeup from wait state, it can wake on TSC expiry. I think, that's is the main
> primitive on this feature. Right?
> 
> WFE can also wake based on Timer stream events(kind of TSC in x86
> analogy) but it has a configuration
> bit that needs to allow for this scheme in userspace(EL0) or not?
> defined by EL1(Linux kernel).
Timer stream events are not per CPU core. They are system wide streams.

> I am planning to spend time on this after understanding the value addition of
> the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020-
> May/168888.html
> 
> 
> 
> 
> 
> > Might be ARM people can comment/correct me here.
> > Konstantin

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-06-03  6:22           ` Honnappa Nagarahalli
@ 2020-06-03  6:31             ` Jerin Jacob
  0 siblings, 0 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-06-03  6:31 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Ananyev, Konstantin, Burakov, Anatoly, dev, Richardson, Bruce,
	Hunt, David, Ma, Liang J, nd

On Wed, Jun 3, 2020 at 11:53 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> <snip>
>
> >
> > On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> > >
> > >
> > > > > Hi Anatoly,
> > > > >
> > > > >>
> > > > >> Add two new power management intrinsics, and provide an
> > > > >> implementation in eal/x86 based on UMONITOR/UMWAIT instructions.
> > > > >> The instructions are implemented as raw byte opcodes because
> > > > >> there is not yet widespread compiler support for these instructions.
> > > > >>
> > > > >> The power management instructions provide an
> > > > >> architecture-specific function to either wait until a specified
> > > > >> TSC timestamp is reached, or optionally wait until either a TSC
> > > > >> timestamp is reached or a memory location is written to. The
> > > > >> monitor function also provides an optional comparison, to avoid
> > > > >> sleeping when the expected write has already happened, and no more
> > writes are expected.
> > > > >
> > > > > Recently ARM guys introduced new generic API for similar (as I
> > > > > understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > > Probably would make sense to unite both APIs into something common
> > > > > and HW transparent.
> > > > > Konstantin
> > > >
> > > > Hi Konstantin,
> > > >
> > > > That's not really similar purpose. This is monitoring a cacheline
> > > > for writes, not waiting on a specific value.
> > >
> > > I understand that.
> > >
> > > > The "expected" value is there
> > > > as basically a hack to get around the race condition due to the fact
> > > > that by the time you enter monitoring state, the write you're
> > > > waiting for may have already happened.
> > >
> > > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> > >
> > > LDXR memaddr, $reg  // an address to monitor for if ($reg !=
> > > expected_value)
> > >    SEVL      //     arm monitor
> > >    do {
> > >        WFE     //      waits for write to that memory address
> > >        LDXR memaddr, $reg
> > >    } while ($reg != expected_value);
> > >
> > > Looks pretty similar to what rte_power_monitor() does, except you
> > > don't have a loop for checking the new value.
> > > Plus rte_power_monitor() provides extra options to the user -
> > > timestamp and power save mode to enter.
> > > Also I don't know what is the granularity of such events on ARM, is it
> > > a cache-line or more/less.
> >
> > As I understand it, Granularity is per the cache-line.
> > ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until
> > the cache line is written.
> Architecture allows for 16B to 2048B space. Typically, implementations use cache-line granularity.
>
> >
> > But I see UMONITOR bit different, Where _without_ other core signaling to
> > wakeup from wait state, it can wake on TSC expiry. I think, that's is the main
> > primitive on this feature. Right?
> >
> > WFE can also wake based on Timer stream events(kind of TSC in x86
> > analogy) but it has a configuration
> > bit that needs to allow for this scheme in userspace(EL0) or not?
> > defined by EL1(Linux kernel).
> Timer stream events are not per CPU core. They are system wide streams.

We may not need per core support to implement this use case.

I think, currently, kernel configured to have a WFE signal on every
100us.(System-wide).

do while{} loop can check if it is passing the requested timestamp after WFE.
But minimum granularity will be 100us.(i.e 100us worth of ticks)


>
> > I am planning to spend time on this after understanding the value addition of
> > the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020-
> > May/168888.html
> >
> >
> >
> >
> >
> > > Might be ARM people can comment/correct me here.
> > > Konstantin

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (6 preceding siblings ...)
  2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
@ 2020-08-11 10:27 ` Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
                     ` (6 more replies)
  7 siblings, 7 replies; 421+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 138 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 207 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 000000000..8646c4ac1
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd0902795..3a12e87e1 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2f..494a8142a 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d..94d6a4376 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
 	RTE_CPUFLAG_EM64T,                  /**< EM64T */
 
+	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
 	/* (EAX 80000007h) EDX features */
 	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
 
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 000000000..af8aa9459
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,138 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+	rte_mb();
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e795..0325c4b93 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
@ 2020-08-11 10:27   ` Liang Ma
  2020-08-13 18:11     ` Liang, Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API Liang Ma
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 queue use case as there is no
coordination between queues/cores in ethdev.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to releaf the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 config/common_base                       |   4 +-
 lib/Makefile                             |   1 +
 lib/librte_ethdev/Makefile               |   2 +-
 lib/librte_ethdev/meson.build            |   2 +-
 lib/librte_ethdev/rte_ethdev.c           | 198 +++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h           |  59 +++++++
 lib/librte_ethdev/rte_ethdev_core.h      |  43 ++++-
 lib/librte_ethdev/rte_ethdev_version.map |   4 +
 lib/meson.build                          |   5 +-
 mk/rte.app.mk                            |   2 +-
 10 files changed, 311 insertions(+), 9 deletions(-)

diff --git a/config/common_base b/config/common_base
index f76585f16..e0948f0cb 100644
--- a/config/common_base
+++ b/config/common_base
@@ -155,7 +155,7 @@ CONFIG_RTE_MAX_ETHPORTS=32
 CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
 CONFIG_RTE_LIBRTE_IEEE1588=n
 CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
-CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
+CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=n
 CONFIG_RTE_ETHDEV_PROFILE_WITH_VTUNE=n
 
 #
@@ -978,7 +978,7 @@ CONFIG_RTE_LIBRTE_ACL_DEBUG=n
 #
 # Compile librte_power
 #
-CONFIG_RTE_LIBRTE_POWER=n
+CONFIG_RTE_LIBRTE_POWER=y
 CONFIG_RTE_LIBRTE_POWER_DEBUG=n
 CONFIG_RTE_MAX_LCORE_FREQS=64
 
diff --git a/lib/Makefile b/lib/Makefile
index 8f5b68a2d..87646698a 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -28,6 +28,7 @@ DEPDIRS-librte_ethdev := librte_net librte_eal librte_mempool librte_ring
 DEPDIRS-librte_ethdev += librte_mbuf
 DEPDIRS-librte_ethdev += librte_kvargs
 DEPDIRS-librte_ethdev += librte_meter
+DEPDIRS-librte_ethdev += librte_power
 DIRS-$(CONFIG_RTE_LIBRTE_BBDEV) += librte_bbdev
 DEPDIRS-librte_bbdev := librte_eal librte_mempool librte_mbuf
 DIRS-$(CONFIG_RTE_LIBRTE_CRYPTODEV) += librte_cryptodev
diff --git a/lib/librte_ethdev/Makefile b/lib/librte_ethdev/Makefile
index 47747150b..6a4ce14cf 100644
--- a/lib/librte_ethdev/Makefile
+++ b/lib/librte_ethdev/Makefile
@@ -11,7 +11,7 @@ LIB = librte_ethdev.a
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_net -lrte_eal -lrte_mempool -lrte_ring
-LDLIBS += -lrte_mbuf -lrte_kvargs -lrte_meter -lrte_telemetry
+LDLIBS += -lrte_mbuf -lrte_kvargs -lrte_meter -lrte_telemetry -lrte_power
 
 EXPORT_MAP := rte_ethdev_version.map
 
diff --git a/lib/librte_ethdev/meson.build b/lib/librte_ethdev/meson.build
index 8fc24e8c8..e09e2395e 100644
--- a/lib/librte_ethdev/meson.build
+++ b/lib/librte_ethdev/meson.build
@@ -27,4 +27,4 @@ headers = files('rte_ethdev.h',
 	'rte_tm.h',
 	'rte_tm_driver.h')
 
-deps += ['net', 'kvargs', 'meter', 'telemetry']
+deps += ['net', 'kvargs', 'meter', 'telemetry', 'power']
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 7858ad5f1..b43de88ce 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -16,6 +16,7 @@
 #include <netinet/in.h>
 
 #include <rte_byteorder.h>
+#include <rte_cpuflags.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_interrupts.h>
@@ -39,6 +40,7 @@
 #include <rte_class.h>
 #include <rte_ether.h>
 #include <rte_telemetry.h>
+#include <rte_power.h>
 
 #include "rte_ethdev_trace.h"
 #include "rte_ethdev.h"
@@ -185,6 +187,100 @@ enum {
 	STAT_QMAP_RX
 };
 
+
+static uint16_t
+rte_ethdev_pmgmt_umait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+			dev->empty_poll_stats[qidx].num++;
+			if (unlikely(dev->empty_poll_stats[qidx].num >
+					ETH_EMPTYPOLL_MAX)) {
+				volatile void *target_addr;
+				uint64_t expected, mask;
+				uint16_t ret;
+
+				/*
+				 * get address of next descriptor in the RX
+				 * ring for this queue, as well as expected
+				 * value and a mask.
+				 */
+				ret = (*dev->dev_ops->next_rx_desc)
+					(dev->data->rx_queues[qidx],
+					 &target_addr, &expected, &mask);
+				if (ret == 0)
+					/* -1ULL is maximum value for TSC */
+					rte_power_monitor(target_addr,
+							  expected, mask,
+							  0, -1ULL);
+			}
+		} else
+			dev->empty_poll_stats[qidx].num = 0;
+	}
+
+	return 0;
+}
+
+static uint16_t
+rte_ethdev_pmgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	int i;
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+
+			dev->empty_poll_stats[qidx].num++;
+
+			if (unlikely(dev->empty_poll_stats[qidx].num >
+					ETH_EMPTYPOLL_MAX)) {
+
+				for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
+					rte_pause();
+
+			}
+		} else
+			dev->empty_poll_stats[qidx].num = 0;
+	}
+
+	return 0;
+}
+
+static uint16_t
+rte_ethdev_pmgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+			dev->empty_poll_stats[qidx].num++;
+			if (unlikely(dev->empty_poll_stats[qidx].num >
+					ETH_EMPTYPOLL_MAX)) {
+
+				/*scale down freq */
+				rte_power_freq_min(rte_lcore_id());
+
+			}
+		} else {
+			dev->empty_poll_stats[qidx].num = 0;
+			/* scal up freq */
+			rte_power_freq_max(rte_lcore_id());
+		}
+	}
+
+	return 0;
+}
+
 int
 rte_eth_iterator_init(struct rte_dev_iterator *iter, const char *devargs_str)
 {
@@ -5113,6 +5209,108 @@ rte_eth_dev_pool_ops_supported(uint16_t port_id, const char *pool)
 	return (*dev->dev_ops->pool_ops_supported)(dev, pool);
 }
 
+int
+rte_eth_dev_power_mgmt_enable(unsigned int lcore_id,
+			      uint16_t port_id,
+			 enum rte_eth_dev_power_mgmt_cb_mode mode)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+						  sizeof(struct rte_eth_ep_stat)
+						  * RTE_MAX_QUEUES_PER_PORT,
+						  0, dev->data->numa_node);
+
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
+		return -EINVAL;
+
+	dev->cb_mode = mode;
+
+	switch (mode) {
+
+	case RTE_ETH_DEV_POWER_MGMT_CB_UMWAIT:
+
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+			return -ENOTSUP;
+
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_ethdev_pmgmt_umait, NULL);
+		break;
+
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+
+		/* init scale freq */
+		if (rte_power_init(lcore_id))
+			return -EINVAL;
+
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+					rte_ethdev_pmgmt_scalefreq, NULL);
+		break;
+
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_ethdev_pmgmt_pause, NULL);
+		break;
+
+	}
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_eth_dev_power_mgmt_disable(unsigned int lcore_id,
+			       uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/*add flag check */
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)  {
+		/* rte_free ignores NULL so safe to call without checks */
+		rte_free(dev->empty_poll_stats);
+
+		switch (dev->cb_mode) {
+
+		case RTE_ETH_DEV_POWER_MGMT_CB_UMWAIT:
+
+		case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+
+			rte_eth_remove_rx_callback(port_id, 0,
+						   dev->cur_pwr_cb);
+
+			break;
+
+		case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+
+			rte_power_freq_max(lcore_id);
+
+			rte_eth_remove_rx_callback(port_id, 0,
+						   dev->cur_pwr_cb);
+
+			if (rte_power_exit(lcore_id))
+				return -EINVAL;
+
+			break;
+		}
+
+		dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+
+	}
+	return 0;
+}
+
 /**
  * A set of values to describe the possible states of a switch domain.
  */
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 57e4a6ca5..6858c0338 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1603,6 +1605,25 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED,
+};
+
+enum rte_eth_dev_power_mgmt_cb_mode {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_CB_UMWAIT = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
+	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
@@ -4415,6 +4436,40 @@ __rte_experimental
 int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
 				       struct rte_eth_hairpin_cap *cap);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_enable(unsigned int lcore_id,
+				  uint16_t port_id,
+				  enum rte_eth_dev_power_mgmt_cb_mode mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
+
 #include <rte_ethdev_core.h>
 
 /**
@@ -4535,6 +4590,7 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 	return nb_rx;
 }
 
+
 /**
  * Get the number of used descriptors of a rx queue
  *
@@ -4993,6 +5049,9 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
 	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
 }
 
+
+
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd41..7d6d85ddc 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,27 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ *
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +773,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +791,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +838,16 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
+	uint32_t reserved_32;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	const struct rte_eth_rxtx_callback *cur_pwr_cb;
+	void *reserved_ptrs[3];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index 1212a17d3..4d5b63a5b 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -241,6 +241,10 @@ EXPERIMENTAL {
 	__rte_ethdev_trace_rx_burst;
 	__rte_ethdev_trace_tx_burst;
 	rte_flow_get_aged_flows;
+
+	# added in 20.08
+	rte_eth_dev_power_mgmt_disable;
+	rte_eth_dev_power_mgmt_enable;
 };
 
 INTERNAL {
diff --git a/lib/meson.build b/lib/meson.build
index 3852c0156..54cc0db7d 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -14,17 +14,18 @@ libraries = [
 	'eal', # everything depends on eal
 	'ring',
 	'rcu', # rcu depends on ring
+	'timer',   # eventdev depends on this
+	'power',   # eventdev depends on this
 	'mempool', 'mbuf', 'net', 'meter', 'ethdev', 'pci', # core
 	'cmdline',
 	'metrics', # bitrate/latency stats depends on this
 	'hash',    # efd depends on this
-	'timer',   # eventdev depends on this
 	'acl', 'bbdev', 'bitratestats', 'cfgfile',
 	'compressdev', 'cryptodev',
 	'distributor', 'efd', 'eventdev',
 	'gro', 'gso', 'ip_frag', 'jobstats',
 	'kni', 'latencystats', 'lpm', 'member',
-	'power', 'pdump', 'rawdev', 'regexdev',
+	'pdump', 'rawdev', 'regexdev',
 	'rib', 'reorder', 'sched', 'security', 'stack', 'vhost',
 	# ipsec lib depends on net, crypto and security
 	'ipsec',
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index a54425997..b87abb26e 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -58,7 +58,6 @@ endif
 _LDLIBS-$(CONFIG_RTE_LIBRTE_METRICS)        += --no-whole-archive
 _LDLIBS-$(CONFIG_RTE_LIBRTE_BITRATE)        += -lrte_bitratestats
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LATENCY_STATS)  += -lrte_latencystats
-_LDLIBS-$(CONFIG_RTE_LIBRTE_POWER)          += -lrte_power
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_EFD)            += -lrte_efd
 _LDLIBS-$(CONFIG_RTE_LIBRTE_BPF)            += -lrte_bpf
@@ -80,6 +79,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_KVARGS)         += -lrte_kvargs
 _LDLIBS-y                                   += -lrte_telemetry
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MBUF)           += -lrte_mbuf
 _LDLIBS-$(CONFIG_RTE_LIBRTE_NET)            += -lrte_net
+_LDLIBS-$(CONFIG_RTE_LIBRTE_POWER)          += -lrte_power
 _LDLIBS-$(CONFIG_RTE_LIBRTE_ETHER)          += -lrte_ethdev
 _LDLIBS-$(CONFIG_RTE_LIBRTE_BBDEV)          += -lrte_bbdev
 _LDLIBS-$(CONFIG_RTE_LIBRTE_CRYPTODEV)      += -lrte_cryptodev
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
@ 2020-08-11 10:27   ` Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 4/5] net/i40e: " Liang Ma
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index fd0cb9b0e..618fc1573 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -592,6 +592,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf513..d1d015dea 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b2..826f451be 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC v2 4/5] net/i40e: implement power management API
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API Liang Ma
@ 2020-08-11 10:27   ` Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 5/5] net/ice: " Liang Ma
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 05d5f2861..f0797c3cb 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -515,6 +515,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index fe7f9200c..9d7eea8ae 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160..bfda5b6ad 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC v2 5/5] net/ice: implement power management API
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (2 preceding siblings ...)
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 4/5] net/i40e: " Liang Ma
@ 2020-08-11 10:27   ` Liang Ma
  2020-08-13 18:04   ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang, Ma
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 7dd3fcd27..7a636cd11 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -212,6 +212,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index cc3139042..ce7e025b6 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d0..7eb6fa904 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (3 preceding siblings ...)
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 5/5] net/ice: " Liang Ma
@ 2020-08-13 18:04   ` Liang, Ma
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  6 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-08-13 18:04 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

On 11 Aug 11:27, Liang Ma wrote:
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> --- 
<snip> 
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d..94d6a4376 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
>  	RTE_CPUFLAG_EM64T,                  /**< EM64T */
>  
> +	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
need re-work the order to avoid breaking ABI
>  	/* (EAX 80000007h) EDX features */
>  	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
>  
</snip>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
@ 2020-08-13 18:11     ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-08-13 18:11 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

On 11 Aug 11:27, Liang Ma wrote:
<snip>
> +static uint16_t
> +rte_ethdev_pmgmt_umait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +			dev->empty_poll_stats[qidx].num++;
> +			if (unlikely(dev->empty_poll_stats[qidx].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +				volatile void *target_addr;
> +				uint64_t expected, mask;
> +				uint16_t ret;
> +
> +				/*
> +				 * get address of next descriptor in the RX
> +				 * ring for this queue, as well as expected
> +				 * value and a mask.
> +				 */
> +				ret = (*dev->dev_ops->next_rx_desc)
> +					(dev->data->rx_queues[qidx],
> +					 &target_addr, &expected, &mask);
> +				if (ret == 0)
> +					/* -1ULL is maximum value for TSC */
> +					rte_power_monitor(target_addr,
> +							  expected, mask,
> +							  0, -1ULL);
> +			}
> +		} else
> +			dev->empty_poll_stats[qidx].num = 0;
> +	}
> +
> +	return 0;
should return nb_rx here. that's fixed in v3.
> +}
> +
> +static uint16_t
> +rte_ethdev_pmgmt_pause(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	int i;
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +
> +			dev->empty_poll_stats[qidx].num++;
> +
> +			if (unlikely(dev->empty_poll_stats[qidx].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +
> +				for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
> +					rte_pause();
> +
> +			}
> +		} else
> +			dev->empty_poll_stats[qidx].num = 0;
> +	}
> +
> +	return 0;
should return  nb_rx here. that's fixed in v3.
> +}
> +
> +static uint16_t
> +rte_ethdev_pmgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +			dev->empty_poll_stats[qidx].num++;
> +			if (unlikely(dev->empty_poll_stats[qidx].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +
> +				/*scale down freq */
> +				rte_power_freq_min(rte_lcore_id());
> +
> +			}
> +		} else {
> +			dev->empty_poll_stats[qidx].num = 0;
> +			/* scal up freq */
> +			rte_power_freq_max(rte_lcore_id());
> +		}
> +	}
> +
> +	return 0;
should return  nb_rx here. that's fixed in v3.
> +}
> +
</snip>

 -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 1/6] eal: add power management intrinsics
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (4 preceding siblings ...)
  2020-08-13 18:04   ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang, Ma
@ 2020-09-03 16:06   ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
                       ` (4 more replies)
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  6 siblings, 5 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:06 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   2 +
 .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 213 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..cd7f8070ac
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..5041a830a7 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	/**< UMWAIT/TPAUSE Instructions */
+	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..6dd1cdc939
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback Liang Ma
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple API allow ethdev get the last
available queue descriptor address from PMD.
Also include internal structure update.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.h      | 22 ++++++++++++++
 lib/librte_ethdev/rte_ethdev_core.h | 46 +++++++++++++++++++++++++++--
 2 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 70295d7ab7..d9312d3e11 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1602,6 +1604,26 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED,
+};
+
+enum rte_eth_dev_power_mgmt_cb_mode {
+	/** WAIT callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd418..16e54bb4e4 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,30 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ * @param expected
+ *   the pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   the pointer point to comparison bitmask for the expected value.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +776,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +794,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +841,16 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	/**< Power mgmt Callback mode */
+	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	const struct rte_eth_rxtx_callback *cur_pwr_cb;
+	void *reserved_ptrs[2];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 port 1 queue use case as there is
no coordination between queues/cores in ethdev. 1 port map to multiple
core will be supported in next version.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to releaf the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_power/meson.build           |   3 +-
 lib/librte_power/rte_power.h           |  38 +++++
 lib/librte_power/rte_power_pmd_mgmt.c  | 184 +++++++++++++++++++++++++
 lib/librte_power/rte_power_version.map |   4 +
 4 files changed, 228 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c

diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..44b01afce2 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
 headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power.h b/lib/librte_power/rte_power.h
index bbbde4dfb4..06d5a9984f 100644
--- a/lib/librte_power/rte_power.h
+++ b/lib/librte_power/rte_power.h
@@ -14,6 +14,7 @@
 #include <rte_byteorder.h>
 #include <rte_log.h>
 #include <rte_string_fns.h>
+#include <rte_ethdev.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -97,6 +98,43 @@ int rte_power_init(unsigned int lcore_id);
  */
 int rte_power_exit(unsigned int lcore_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function mode.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+				  uint16_t port_id,
+				  enum rte_eth_dev_power_mgmt_cb_mode mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
+
 /**
  * Get the available frequencies of a specific lcore.
  * Function pointer definition. Review each environments
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..a445153ede
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+
+#include "rte_power.h"
+
+
+
+static uint16_t
+rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = (*dev->dev_ops->next_rx_desc)
+				(dev->data->rx_queues[qidx],
+				 &target_addr, &expected, &mask);
+			if (ret == 0)
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor(target_addr,
+						  expected, mask,
+						  0, -1ULL);
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	int i;
+
+	if (unlikely(nb_rx == 0)) {
+
+		dev->empty_poll_stats[qidx].num++;
+
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
+				rte_pause();
+
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			/*scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+
+		}
+	} else {
+		dev->empty_poll_stats[qidx].num = 0;
+		/* scal up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+			uint16_t port_id,
+			enum rte_eth_dev_power_mgmt_cb_mode mode)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
+		return -EINVAL;
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+						  sizeof(struct rte_eth_ep_stat)
+						  * RTE_MAX_QUEUES_PER_PORT,
+						  0, dev->data->numa_node);
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	switch (mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+			return -ENOTSUP;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_umwait, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		/* init scale freq */
+		if (rte_power_init(lcore_id))
+			return -EINVAL;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+					rte_power_mgmt_scalefreq, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_pause, NULL);
+		break;
+	}
+
+	dev->cb_mode = mode;
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_power_pmd_mgmt_disable(unsigned int lcore_id,
+				uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/*add flag check */
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
+		return -EINVAL;
+
+	/* rte_free ignores NULL so safe to call without checks */
+	rte_free(dev->empty_poll_stats);
+
+	switch (dev->cb_mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		if (rte_power_exit(lcore_id))
+			return -EINVAL;
+		break;
+	}
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+	dev->cur_pwr_cb = NULL;
+	dev->cb_mode = 0;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 00ee5753e2..ade83cfd4f 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.08
+	rte_power_pmd_mgmt_disable;
+	rte_power_pmd_mgmt_enable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 6/6] net/ice: " Liang Ma
  4 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index fd0cb9b0e2..618fc15732 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -592,6 +592,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf5137..d1d015deae 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b22..826f451bee 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: implement power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
                       ` (2 preceding siblings ...)
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 6/6] net/ice: " Liang Ma
  4 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 11c02b1888..94e9298d7c 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -517,6 +517,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index fe7f9200c1..9d7eea8aed 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..bfda5b6ad3 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 6/6] net/ice: implement power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
                       ` (3 preceding siblings ...)
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: " Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  4 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 8d435e8892..7d7e1dcbac 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -212,6 +212,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 2e1f06d2c0..c043181ceb 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d04..7eb6fa904e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (5 preceding siblings ...)
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
@ 2020-09-04 10:18   ` Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
                       ` (10 more replies)
  6 siblings, 11 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   2 +
 .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 213 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..cd7f8070ac
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..5041a830a7 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	/**< UMWAIT/TPAUSE Instructions */
+	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..6dd1cdc939
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 16:37       ` Stephen Hemminger
  2020-09-04 20:54       ` Ananyev, Konstantin
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
                       ` (9 subsequent siblings)
  10 siblings, 2 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple API allow ethdev get the last
available queue descriptor address from PMD.
Also include internal structure update.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.h      | 22 ++++++++++++++
 lib/librte_ethdev/rte_ethdev_core.h | 46 +++++++++++++++++++++++++++--
 2 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 70295d7ab7..d9312d3e11 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1602,6 +1604,26 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED,
+};
+
+enum rte_eth_dev_power_mgmt_cb_mode {
+	/** WAIT callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd418..16e54bb4e4 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,30 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ * @param expected
+ *   the pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   the pointer point to comparison bitmask for the expected value.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +776,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +794,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +841,16 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	/**< Power mgmt Callback mode */
+	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	const struct rte_eth_rxtx_callback *cur_pwr_cb;
+	void *reserved_ptrs[2];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 16:36       ` Stephen Hemminger
  2020-09-04 18:33       ` Ananyev, Konstantin
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
                       ` (8 subsequent siblings)
  10 siblings, 2 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 port 1 queue use case as there is
no coordination between queues/cores in ethdev. 1 port map to multiple
core will be supported in next version.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to releaf the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_power/meson.build           |   3 +-
 lib/librte_power/rte_power.h           |  38 +++++
 lib/librte_power/rte_power_pmd_mgmt.c  | 184 +++++++++++++++++++++++++
 lib/librte_power/rte_power_version.map |   4 +
 4 files changed, 228 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c

diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..44b01afce2 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
 headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power.h b/lib/librte_power/rte_power.h
index bbbde4dfb4..06d5a9984f 100644
--- a/lib/librte_power/rte_power.h
+++ b/lib/librte_power/rte_power.h
@@ -14,6 +14,7 @@
 #include <rte_byteorder.h>
 #include <rte_log.h>
 #include <rte_string_fns.h>
+#include <rte_ethdev.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -97,6 +98,43 @@ int rte_power_init(unsigned int lcore_id);
  */
 int rte_power_exit(unsigned int lcore_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function mode.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+				  uint16_t port_id,
+				  enum rte_eth_dev_power_mgmt_cb_mode mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
+
 /**
  * Get the available frequencies of a specific lcore.
  * Function pointer definition. Review each environments
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..a445153ede
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+
+#include "rte_power.h"
+
+
+
+static uint16_t
+rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = (*dev->dev_ops->next_rx_desc)
+				(dev->data->rx_queues[qidx],
+				 &target_addr, &expected, &mask);
+			if (ret == 0)
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor(target_addr,
+						  expected, mask,
+						  0, -1ULL);
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	int i;
+
+	if (unlikely(nb_rx == 0)) {
+
+		dev->empty_poll_stats[qidx].num++;
+
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
+				rte_pause();
+
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			/*scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+
+		}
+	} else {
+		dev->empty_poll_stats[qidx].num = 0;
+		/* scal up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+			uint16_t port_id,
+			enum rte_eth_dev_power_mgmt_cb_mode mode)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
+		return -EINVAL;
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+						  sizeof(struct rte_eth_ep_stat)
+						  * RTE_MAX_QUEUES_PER_PORT,
+						  0, dev->data->numa_node);
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	switch (mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+			return -ENOTSUP;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_umwait, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		/* init scale freq */
+		if (rte_power_init(lcore_id))
+			return -EINVAL;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+					rte_power_mgmt_scalefreq, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_pause, NULL);
+		break;
+	}
+
+	dev->cb_mode = mode;
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_power_pmd_mgmt_disable(unsigned int lcore_id,
+				uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/*add flag check */
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
+		return -EINVAL;
+
+	/* rte_free ignores NULL so safe to call without checks */
+	rte_free(dev->empty_poll_stats);
+
+	switch (dev->cb_mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		if (rte_power_exit(lcore_id))
+			return -EINVAL;
+		break;
+	}
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+	dev->cur_pwr_cb = NULL;
+	dev->cb_mode = 0;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 00ee5753e2..ade83cfd4f 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.08
+	rte_power_pmd_mgmt_disable;
+	rte_power_pmd_mgmt_enable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 5/6] net/i40e: " Liang Ma
                       ` (7 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index fd0cb9b0e2..618fc15732 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -592,6 +592,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf5137..d1d015deae 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b22..826f451bee 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 5/6] net/i40e: implement power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (2 preceding siblings ...)
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 10:19     ` [dpdk-dev] [PATCH v3 6/6] net/ice: " Liang Ma
                       ` (6 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 11c02b1888..94e9298d7c 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -517,6 +517,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index fe7f9200c1..9d7eea8aed 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..bfda5b6ad3 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 6/6] net/ice: implement power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (3 preceding siblings ...)
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 5/6] net/i40e: " Liang Ma
@ 2020-09-04 10:19     ` Liang Ma
  2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
                       ` (5 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:19 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 8d435e8892..7d7e1dcbac 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -212,6 +212,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 2e1f06d2c0..c043181ceb 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d04..7eb6fa904e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (4 preceding siblings ...)
  2020-09-04 10:19     ` [dpdk-dev] [PATCH v3 6/6] net/ice: " Liang Ma
@ 2020-09-04 16:23     ` Stephen Hemminger
  2020-09-14 20:48       ` Liang, Ma
  2020-09-04 16:37     ` Stephen Hemminger
                       ` (4 subsequent siblings)
  10 siblings, 1 reply; 421+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:23 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:55 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp);

Since this is generic code, and you are defining the function.
You should have it return -ENOTSUPPORTED or -EINVAL.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
@ 2020-09-04 16:36       ` Stephen Hemminger
  2020-09-14 20:52         ` Liang, Ma
  2020-09-04 18:33       ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:36 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:57 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API is limited to 1 core 1 port 1 queue use case as there is
> no coordination between queues/cores in ethdev. 1 port map to multiple
> core will be supported in next version.

The common way to express is this is:

This API is not thread-safe and not preempt-safe.
There is also no mechanism for a single thread to wait on multiple queues.

> 
> This design leverage RX Callback mechnaism which allow three
> different power management methodology co exist.

nit coexist is one word

> 
> 1. umwait/umonitor:
> 
>    The TSC timestamp is automatically calculated using current
>    link speed and RX descriptor ring size, such that the sleep
>    time is not longer than it would take for a NIC to fill its
>    entire RX descriptor ring.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this lightweight
>    method use Pause instruction to releaf the processor from
>    busy polling.

Wording here is a problem, and "releaf" should be "relief"?
Rewording into active voice grammar would be easier.

     Use Pause instruction to allow processor to go into deeper C
     state when busy polling.



> 
> 3. Frequency Scaling
>    Reuse exist rte power library to scale up/down core frequency
>    depend on traffic volume.


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
@ 2020-09-04 16:37       ` Stephen Hemminger
  2020-09-14 21:04         ` Liang, Ma
  2020-09-04 20:54       ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:37 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:56 +0100
Liang Ma <liang.j.ma@intel.com> wrote:



> +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */

Spelling here.

Also, shouldn't this be a per-device (or per-queue) configuration value.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (5 preceding siblings ...)
  2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
@ 2020-09-04 16:37     ` Stephen Hemminger
  2020-09-14 20:49       ` Liang, Ma
  2020-09-04 18:42     ` Stephen Hemminger
                       ` (3 subsequent siblings)
  10 siblings, 1 reply; 421+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:37 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:55 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

This looks like a useful feature but needs more documentation and example.
It would make sense to put an example in l3fwd-power. 


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
  2020-09-04 16:36       ` Stephen Hemminger
@ 2020-09-04 18:33       ` Ananyev, Konstantin
  2020-09-14 21:01         ` Liang, Ma
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-09-04 18:33 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, Burakov, Anatoly, Ma, Liang J

> 
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API is limited to 1 core 1 port 1 queue use case as there is
> no coordination between queues/cores in ethdev. 1 port map to multiple
> core will be supported in next version.
> 
> This design leverage RX Callback mechnaism which allow three
> different power management methodology co exist.
> 
> 1. umwait/umonitor:
> 
>    The TSC timestamp is automatically calculated using current
>    link speed and RX descriptor ring size, such that the sleep
>    time is not longer than it would take for a NIC to fill its
>    entire RX descriptor ring.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this lightweight
>    method use Pause instruction to releaf the processor from
>    busy polling.
> 
> 3. Frequency Scaling
>    Reuse exist rte power library to scale up/down core frequency
>    depend on traffic volume.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_power/meson.build           |   3 +-
>  lib/librte_power/rte_power.h           |  38 +++++
>  lib/librte_power/rte_power_pmd_mgmt.c  | 184 +++++++++++++++++++++++++
>  lib/librte_power/rte_power_version.map |   4 +
>  4 files changed, 228 insertions(+), 1 deletion(-)
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
> 
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 78c031c943..44b01afce2 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
>  		'power_kvm_vm.c', 'guest_channel.c',
>  		'rte_power_empty_poll.c',
>  		'power_pstate_cpufreq.c',
> +		'rte_power_pmd_mgmt.c',
>  		'power_common.c')
>  headers = files('rte_power.h','rte_power_empty_poll.h')
> -deps += ['timer']
> +deps += ['timer' ,'ethdev']
> diff --git a/lib/librte_power/rte_power.h b/lib/librte_power/rte_power.h
> index bbbde4dfb4..06d5a9984f 100644
> --- a/lib/librte_power/rte_power.h
> +++ b/lib/librte_power/rte_power.h
> @@ -14,6 +14,7 @@
>  #include <rte_byteorder.h>
>  #include <rte_log.h>
>  #include <rte_string_fns.h>
> +#include <rte_ethdev.h>
> 
>  #ifdef __cplusplus
>  extern "C" {
> @@ -97,6 +98,43 @@ int rte_power_init(unsigned int lcore_id);
>   */
>  int rte_power_exit(unsigned int lcore_id);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable device power management.
> + * @param lcore_id
> + *   lcore id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param mode
> + *   The power management callback function mode.
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_power_pmd_mgmt_enable(unsigned int lcore_id,
> +				  uint16_t port_id,
> +				  enum rte_eth_dev_power_mgmt_cb_mode mode);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Disable device power management.
> + * @param lcore_id
> + *   lcore id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_power_pmd_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
> +
>  /**
>   * Get the available frequencies of a specific lcore.
>   * Function pointer definition. Review each environments
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
> new file mode 100644
> index 0000000000..a445153ede
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
> @@ -0,0 +1,184 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_atomic.h>
> +#include <rte_malloc.h>
> +#include <rte_ethdev.h>
> +
> +#include "rte_power.h"
> +
> +
> +
> +static uint16_t
> +rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		dev->empty_poll_stats[qidx].num++;

I believe there are two fundamental issues with that approach:
1. You put metadata specific lib (power) callbacks into rte_eth_dev struct.
2. These callbacks do access rte_eth_devices[] directly. 
That doesn't look right to me - rte_eth_dev structure supposed to be treated
as internal one librt_ether and underlying drivers and should be accessed directly
by outer code.
If these callbacks need some extra metadata, then it is responsibility
of power library to allocate/manage these metadata.
You can pass pointer to this metadata via last parameter for rte_eth_add_rx_callback().

> +		if (unlikely(dev->empty_poll_stats[qidx].num >
> +			     ETH_EMPTYPOLL_MAX)) {
> +			volatile void *target_addr;
> +			uint64_t expected, mask;
> +			uint16_t ret;
> +
> +			/*
> +			 * get address of next descriptor in the RX
> +			 * ring for this queue, as well as expected
> +			 * value and a mask.
> +			 */
> +			ret = (*dev->dev_ops->next_rx_desc)
> +				(dev->data->rx_queues[qidx],
> +				 &target_addr, &expected, &mask);
> +			if (ret == 0)
> +				/* -1ULL is maximum value for TSC */
> +				rte_power_monitor(target_addr,
> +						  expected, mask,
> +						  0, -1ULL);
> +		}
> +	} else
> +		dev->empty_poll_stats[qidx].num = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	int i;
> +
> +	if (unlikely(nb_rx == 0)) {
> +
> +		dev->empty_poll_stats[qidx].num++;
> +
> +		if (unlikely(dev->empty_poll_stats[qidx].num >
> +			     ETH_EMPTYPOLL_MAX)) {
> +
> +			for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
> +				rte_pause();
> +
> +		}
> +	} else
> +		dev->empty_poll_stats[qidx].num = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		dev->empty_poll_stats[qidx].num++;
> +		if (unlikely(dev->empty_poll_stats[qidx].num >
> +			     ETH_EMPTYPOLL_MAX)) {
> +
> +			/*scale down freq */
> +			rte_power_freq_min(rte_lcore_id());
> +
> +		}
> +	} else {
> +		dev->empty_poll_stats[qidx].num = 0;
> +		/* scal up freq */
> +		rte_power_freq_max(rte_lcore_id());
> +	}
> +
> +	return nb_rx;
> +}
> +
> +int
> +rte_power_pmd_mgmt_enable(unsigned int lcore_id,
> +			uint16_t port_id,
> +			enum rte_eth_dev_power_mgmt_cb_mode mode)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
> +		return -EINVAL;
> +	/* allocate memory for empty poll stats */
> +	dev->empty_poll_stats = rte_malloc_socket(NULL,
> +						  sizeof(struct rte_eth_ep_stat)
> +						  * RTE_MAX_QUEUES_PER_PORT,
> +						  0, dev->data->numa_node);
> +	if (dev->empty_poll_stats == NULL)
> +		return -ENOMEM;
> +
> +	switch (mode) {
> +	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> +		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
> +			return -ENOTSUP;

Here and in other places: in case of error return you don't' free your empty_poll_stats.

> +		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,

Why zero for queue number, why not to pass queue_id as a parameter for that function?

> +						rte_power_mgmt_umwait, NULL);

As I said above, instead of NULL - could be pointer to metadata struct.

> +		break;
> +	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> +		/* init scale freq */
> +		if (rte_power_init(lcore_id))
> +			return -EINVAL;
> +		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> +					rte_power_mgmt_scalefreq, NULL);
> +		break;
> +	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> +		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> +						rte_power_mgmt_pause, NULL);
> +		break;
> +	}
> +
> +	dev->cb_mode = mode;
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
> +	return 0;
> +}
> +
> +int
> +rte_power_pmd_mgmt_disable(unsigned int lcore_id,
> +				uint16_t port_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	/*add flag check */
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
> +		return -EINVAL;
> +
> +	/* rte_free ignores NULL so safe to call without checks */
> +	rte_free(dev->empty_poll_stats);

You can't free callback metadata before removing the callback itself.
In fact, with current rx callback code it is not safe to free it
even after (we discussed it offline).

> +
> +	switch (dev->cb_mode) {
> +	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> +	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> +		rte_eth_remove_rx_callback(port_id, 0,
> +					   dev->cur_pwr_cb);
> +		break;
> +	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> +		rte_power_freq_max(lcore_id);

Stupid q: what makes you think that lcore frequency was max,
*before* you setup the callback?

> +		rte_eth_remove_rx_callback(port_id, 0,
> +					   dev->cur_pwr_cb);
> +		if (rte_power_exit(lcore_id))
> +			return -EINVAL;
> +		break;
> +	}
> +
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> +	dev->cur_pwr_cb = NULL;
> +	dev->cb_mode = 0;
> +
> +	return 0;
> +}
> diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> index 00ee5753e2..ade83cfd4f 100644
> --- a/lib/librte_power/rte_power_version.map
> +++ b/lib/librte_power/rte_power_version.map
> @@ -34,4 +34,8 @@ EXPERIMENTAL {
>  	rte_power_guest_channel_receive_msg;
>  	rte_power_poll_stat_fetch;
>  	rte_power_poll_stat_update;
> +	# added in 20.08
> +	rte_power_pmd_mgmt_disable;
> +	rte_power_pmd_mgmt_enable;
> +
>  };
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (6 preceding siblings ...)
  2020-09-04 16:37     ` Stephen Hemminger
@ 2020-09-04 18:42     ` Stephen Hemminger
  2020-09-14 21:12       ` Liang, Ma
  2020-09-16 16:34       ` Liang, Ma
  2020-09-06 21:44     ` Ananyev, Konstantin
                       ` (2 subsequent siblings)
  10 siblings, 2 replies; 421+ messages in thread
From: Stephen Hemminger @ 2020-09-04 18:42 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:55 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

Before this is merged, please work with Arm maintainers to have a version that
works on Arm 64 as well. Don't think this should be merged unless the two major
platforms supported by DPDK can work with it.

Also, not sure if this mechanism can work with other drivers. You need to
work with other vendors to show that the same infrastructure can work with
their hardware. Once again, I don't think this can go in if it only can
work on Intel.  It needs to work on Broadcom, Mellanox to be useful.

Will it work in a VM? Will it work with virtio or vmxnet3?

Having a single vendor solution is a non-starter for me.
They don't all have to be there to get it merged, but if the design only
works on single platform then it is not helpful.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
  2020-09-04 16:37       ` Stephen Hemminger
@ 2020-09-04 20:54       ` Ananyev, Konstantin
  1 sibling, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-09-04 20:54 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, Burakov, Anatoly, Ma, Liang J

> Add a simple API allow ethdev get the last
> available queue descriptor address from PMD.
> Also include internal structure update.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_ethdev/rte_ethdev.h      | 22 ++++++++++++++
>  lib/librte_ethdev/rte_ethdev_core.h | 46 +++++++++++++++++++++++++++--
>  2 files changed, 66 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 70295d7ab7..d9312d3e11 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -157,6 +157,7 @@ extern "C" {
>  #include <rte_common.h>
>  #include <rte_config.h>
>  #include <rte_ether.h>
> +#include <rte_power_intrinsics.h>
> 
>  #include "rte_ethdev_trace_fp.h"
>  #include "rte_dev_info.h"
> @@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
>  /** Maximum nb. of vlan per mirror rule */
>  #define ETH_MIRROR_MAX_VLANS       64
> 
> +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
>  #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
>  #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
>  #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
> @@ -1602,6 +1604,26 @@ enum rte_eth_dev_state {
>  	RTE_ETH_DEV_REMOVED,
>  };
> 
> +#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
> +/**
> + * Possible power management states of an ethdev port.
> + */
> +enum rte_eth_dev_power_mgmt_state {
> +	/** Device power management is disabled. */
> +	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
> +	/** Device power management is enabled. */
> +	RTE_ETH_DEV_POWER_MGMT_ENABLED,
> +};
> +
> +enum rte_eth_dev_power_mgmt_cb_mode {
> +	/** WAIT callback mode. */
> +	RTE_ETH_DEV_POWER_MGMT_CB_WAIT = 1,
> +	/** PAUSE callback mode. */
> +	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
> +	/** Freq Scaling callback mode. */
> +	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
> +};
> +

I don't think we need to put all these power related
staff into rte_ethdev library.
rte_power or so, seems like much better place for it.

>  struct rte_eth_dev_sriov {
>  	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
>  	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
> diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
> index 32407dd418..16e54bb4e4 100644
> --- a/lib/librte_ethdev/rte_ethdev_core.h
> +++ b/lib/librte_ethdev/rte_ethdev_core.h
> @@ -603,6 +603,30 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>  	 uint16_t nb_tx_desc,
>  	 const struct rte_eth_hairpin_conf *hairpin_conf);
> 
> +/**
> + * @internal
> + * Get the next RX ring descriptor address.
> + *
> + * @param rxq
> + *   ethdev queue pointer.
> + * @param tail_desc_addr
> + *   the pointer point to descriptor address var.
> + * @param expected
> + *   the pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   the pointer point to comparison bitmask for the expected value.
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success.
> + * @retval -EINVAL
> + *   Failed to get descriptor address.
> + */
> +typedef int (*eth_next_rx_desc_t)
> +	(void *rxq, volatile void **tail_desc_addr,
> +	 uint64_t *expected, uint64_t *mask);
> +

In theory it could be anything: next RXD, doorbell,
even some global variable.
So I think function name needs to be more neutral:
eth_rx_wait_addr() or so.
Also I think you need a new rte_eth_ wrapper function
for that dev op.  

>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -752,6 +776,8 @@ struct eth_dev_ops {
>  	/**< Set up device RX hairpin queue. */
>  	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>  	/**< Set up device TX hairpin queue. */
> +	eth_next_rx_desc_t next_rx_desc;
> +	/**< Get next RX ring descriptor address. */
>  };
> 
>  /**
> @@ -768,6 +794,14 @@ struct rte_eth_rxtx_callback {
>  	void *param;
>  };
> 
> +/**
> + * @internal
> + * Structure used to hold counters for empty poll
> + */
> +struct rte_eth_ep_stat {
> +	uint64_t num;
> +} __rte_cache_aligned;
> +
>  /**
>   * @internal
>   * The generic data structure associated with each ethernet device.
> @@ -807,8 +841,16 @@ struct rte_eth_dev {
>  	enum rte_eth_dev_state state; /**< Flag indicating the port state */
>  	void *security_ctx; /**< Context for security ops */
> 
> -	uint64_t reserved_64s[4]; /**< Reserved for future fields */
> -	void *reserved_ptrs[4];   /**< Reserved for future fields */
> +	/**< Empty poll number */
> +	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
> +	/**< Power mgmt Callback mode */
> +	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
> +	uint64_t reserved_64s[3]; /**< Reserved for future fields */
> +
> +	/**< Flag indicating the port power state */
> +	struct rte_eth_ep_stat *empty_poll_stats;
> +	const struct rte_eth_rxtx_callback *cur_pwr_cb;
> +	void *reserved_ptrs[2];   /**< Reserved for future fields */
>  } __rte_cache_aligned;
> 
>  struct rte_eth_dev_sriov;
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (7 preceding siblings ...)
  2020-09-04 18:42     ` Stephen Hemminger
@ 2020-09-06 21:44     ` Ananyev, Konstantin
  2020-09-18  5:01     ` Jerin Jacob
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
  10 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-09-06 21:44 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, Burakov, Anatoly, Ma, Liang J


> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..6dd1cdc939
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_

As a nit - if the function below are supported for both 64 and 32 ISA,
then probably: RTE_POWER_INTRINSIC_X86_H_


> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +
> +	if (value_mask) {
> +		const uint64_t cur_value = *(const volatile uint64_t *)p;
> +		const uint64_t masked = cur_value & value_mask;
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return 0;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
> +		/*
> +		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
> +		 * onto the stack, then pop them back into `rflags` so that
> +		 * we can read it.
> +		 */
> +		"pushf;\n"
> +		"pop %0;\n"
> +		: "=r"(rflags)
> +		: "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction. For more information about its usage,
> + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
> + * Manual.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
> +		     /*
> +		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
> +		      * onto the stack, then pop them back into `rflags` so that
> +		      * we can read it.
> +		      */
> +		     "pushf;\n"
> +		     "pop %0;\n"
> +		     : "=r"(rflags)
> +		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
@ 2020-09-14 20:48       ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 20:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

Hi Stephen, 
   Agree. v4 will address this.
Regards
Liang
On 04 Sep 09:23, Stephen Hemminger wrote:
> On Fri,  4 Sep 2020 11:18:55 +0100
> Liang Ma <liang.j.ma@intel.com> wrote:
> 
> > + *
> > + * @return
> > + *   Architecture-dependent return value.
> > + */
> > +static inline int rte_power_monitor(const volatile void *p,
> > +		const uint64_t expected_value, const uint64_t value_mask,
> > +		const uint32_t state, const uint64_t tsc_timestamp);
> 
> Since this is generic code, and you are defining the function.
> You should have it return -ENOTSUPPORTED or -EINVAL.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 16:37     ` Stephen Hemminger
@ 2020-09-14 20:49       ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 20:49 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

Hi Stephen, 
   v4 patch will include the l3fwd-power udpate.
Regards
Liang

On 04 Sep 09:37, Stephen Hemminger wrote:
> On Fri,  4 Sep 2020 11:18:55 +0100
> Liang Ma <liang.j.ma@intel.com> wrote:
> 
> > Add two new power management intrinsics, and provide an implementation
> > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > are implemented as raw byte opcodes because there is not yet widespread
> > compiler support for these instructions.
> > 
> > The power management instructions provide an architecture-specific
> > function to either wait until a specified TSC timestamp is reached, or
> > optionally wait until either a TSC timestamp is reached or a memory
> > location is written to. The monitor function also provides an optional
> > comparison, to avoid sleeping when the expected write has already
> > happened, and no more writes are expected.
> > 
> > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> This looks like a useful feature but needs more documentation and example.
> It would make sense to put an example in l3fwd-power. 
>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 16:36       ` Stephen Hemminger
@ 2020-09-14 20:52         ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 20:52 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

<snip>
Hi Stephen, 
    v4 will support 1 port with multiple core(still 1 queue per core)support
    this part description will be updated according to the design change.
Regards
Liang
> The common way to express is this is:
> 
> This API is not thread-safe and not preempt-safe.
> There is also no mechanism for a single thread to wait on multiple queues.
> 
> > 
> > This design leverage RX Callback mechnaism which allow three
> > different power management methodology co exist.
> 
> nit coexist is one word
> 
> > 
> > 1. umwait/umonitor:
> > 
> >    The TSC timestamp is automatically calculated using current
> >    link speed and RX descriptor ring size, such that the sleep
> >    time is not longer than it would take for a NIC to fill its
> >    entire RX descriptor ring.
> > 
> > 2. Pause instruction
> > 
> >    Instead of move the core into deeper C state, this lightweight
> >    method use Pause instruction to releaf the processor from
> >    busy polling.
> 
> Wording here is a problem, and "releaf" should be "relief"?
> Rewording into active voice grammar would be easier.
> 
>      Use Pause instruction to allow processor to go into deeper C
>      state when busy polling.
> 
> 
> 
> > 
> > 3. Frequency Scaling
> >    Reuse exist rte power library to scale up/down core frequency
> >    depend on traffic volume.
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 18:33       ` Ananyev, Konstantin
@ 2020-09-14 21:01         ` Liang, Ma
  2020-09-16 14:53           ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 21:01 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Hunt, David, Burakov, Anatoly

On 04 Sep 11:33, Ananyev, Konstantin wrote:

<snip>
> > +struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +if (unlikely(nb_rx == 0)) {
> > +dev->empty_poll_stats[qidx].num++;
>
Hi Konstantin,
   Agree, v4 will relocate the meta data to seperate structure. 
   and without touch the rte_ethdev structure. 
> I believe there are two fundamental issues with that approach:
> 1. You put metadata specific lib (power) callbacks into rte_eth_dev struct.
> 2. These callbacks do access rte_eth_devices[] directly.
> That doesn't look right to me - rte_eth_dev structure supposed to be treated
> as internal one librt_ether and underlying drivers and should be accessed directly
> by outer code.
> If these callbacks need some extra metadata, then it is responsibility
> of power library to allocate/manage these metadata.
> You can pass pointer to this metadata via last parameter for rte_eth_add_rx_callback().
> 
> > +if (unlikely(dev->empty_poll_stats[qidx].num >
> > +     ETH_EMPTYPOLL_MAX)) {
> > +volatile void *target_addr;
> > +uint64_t expected, mask;
> > +uint16_t ret;
> > +
> > +/*
> > + * get address of next descriptor in the RX
> > + * ring for this queue, as well as expected
> > + * value and a mask.
> > + */
> > +ret = (*dev->dev_ops->next_rx_desc)
> > +(dev->data->rx_queues[qidx],
> > + &target_addr, &expected, &mask);
> > +if (ret == 0)
> > +/* -1ULL is maximum value for TSC */
> > +rte_power_monitor(target_addr,
> > +  expected, mask,
> > +  0, -1ULL);
> > +}
> > +} else
> > +dev->empty_poll_stats[qidx].num = 0;
> > +
> > +return nb_rx;
> > +}
> > +
> > +static uint16_t
> > +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
> > +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> > +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> > +{
> > +struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +int i;
> > +
> > +if (unlikely(nb_rx == 0)) {
> > +
> > +dev->empty_poll_stats[qidx].num++;
> > +
> > +if (unlikely(dev->empty_poll_stats[qidx].num >
> > +     ETH_EMPTYPOLL_MAX)) {
> > +
> > +for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
> > +rte_pause();
> > +
> > +}
> > +} else
> > +dev->empty_poll_stats[qidx].num = 0;
> > +
> > +return nb_rx;
> > +}
> > +
> > +static uint16_t
> > +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> > +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> > +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> > +{
> > +struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +if (unlikely(nb_rx == 0)) {
> > +dev->empty_poll_stats[qidx].num++;
> > +if (unlikely(dev->empty_poll_stats[qidx].num >
> > +     ETH_EMPTYPOLL_MAX)) {
> > +
> > +/*scale down freq */
> > +rte_power_freq_min(rte_lcore_id());
> > +
> > +}
> > +} else {
> > +dev->empty_poll_stats[qidx].num = 0;
> > +/* scal up freq */
> > +rte_power_freq_max(rte_lcore_id());
> > +}
> > +
> > +return nb_rx;
> > +}
> > +
> > +int
> > +rte_power_pmd_mgmt_enable(unsigned int lcore_id,
> > +uint16_t port_id,
> > +enum rte_eth_dev_power_mgmt_cb_mode mode)
> > +{
> > +struct rte_eth_dev *dev;
> > +
> > +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> > +dev = &rte_eth_devices[port_id];
> > +
> > +if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
> > +return -EINVAL;
> > +/* allocate memory for empty poll stats */
> > +dev->empty_poll_stats = rte_malloc_socket(NULL,
> > +  sizeof(struct rte_eth_ep_stat)
> > +  * RTE_MAX_QUEUES_PER_PORT,
> > +  0, dev->data->numa_node);
> > +if (dev->empty_poll_stats == NULL)
> > +return -ENOMEM;
> > +
> > +switch (mode) {
> > +case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> > +if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
> > +return -ENOTSUP;
> 
> Here and in other places: in case of error return you don't' free your empty_poll_stats.
> 
> > +dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> 
> Why zero for queue number, why not to pass queue_id as a parameter for that function?
v4 will move to use queue_id instead of 0. v3 still assume only queue 0 is used.
> 
> > +rte_power_mgmt_umwait, NULL);
> 
> As I said above, instead of NULL - could be pointer to metadata struct.
v4 will address this. 
> 
> > +break;
> > +case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> > +/* init scale freq */
> > +if (rte_power_init(lcore_id))
> > +return -EINVAL;
> > +dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> > +rte_power_mgmt_scalefreq, NULL);
> > +break;
> > +case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> > +dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> > +rte_power_mgmt_pause, NULL);
> > +break;
> > +}
> > +
> > +dev->cb_mode = mode;
> > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
> > +return 0;
> > +}
> > +
> > +int
> > +rte_power_pmd_mgmt_disable(unsigned int lcore_id,
> > +uint16_t port_id)
> > +{
> > +struct rte_eth_dev *dev;
> > +
> > +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> > +dev = &rte_eth_devices[port_id];
> > +
> > +/*add flag check */
> > +
> > +if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
> > +return -EINVAL;
> > +
> > +/* rte_free ignores NULL so safe to call without checks */
> > +rte_free(dev->empty_poll_stats);
> 
> You can't free callback metadata before removing the callback itself.
> In fact, with current rx callback code it is not safe to free it
> even after (we discussed it offline).
agree. 
> 
> > +
> > +switch (dev->cb_mode) {
> > +case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> > +case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> > +rte_eth_remove_rx_callback(port_id, 0,
> > +   dev->cur_pwr_cb);
> > +break;
> > +case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> > +rte_power_freq_max(lcore_id);
> 
> Stupid q: what makes you think that lcore frequency was max,
> *before* you setup the callback?
that is because the rte_power_init() has figured out the system max.
the init code invocate rte_power_init() already. 
> 
> > +rte_eth_remove_rx_callback(port_id, 0,
> > +   dev->cur_pwr_cb);
> > +if (rte_power_exit(lcore_id))
> > +return -EINVAL;
> > +break;
> > +}
> > +
> > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > +dev->cur_pwr_cb = NULL;
> > +dev->cb_mode = 0;
> > +
> > +return 0;
> > +}
> > diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> > index 00ee5753e2..ade83cfd4f 100644
> > --- a/lib/librte_power/rte_power_version.map
> > +++ b/lib/librte_power/rte_power_version.map
> > @@ -34,4 +34,8 @@ EXPERIMENTAL {
> >  rte_power_guest_channel_receive_msg;
> >  rte_power_poll_stat_fetch;
> >  rte_power_poll_stat_update;
> > +# added in 20.08
> > +rte_power_pmd_mgmt_disable;
> > +rte_power_pmd_mgmt_enable;
> > +
> >  };
> > --
> > 2.17.1
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 16:37       ` Stephen Hemminger
@ 2020-09-14 21:04         ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 21:04 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

agree, will be addressed 
On 04 Sep 09:37, Stephen Hemminger wrote:
> On Fri,  4 Sep 2020 11:18:56 +0100
> Liang Ma <liang.j.ma@intel.com> wrote:
> 
> 
> 
> > +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
> 
> Spelling here.
> 
> Also, shouldn't this be a per-device (or per-queue) configuration value.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 18:42     ` Stephen Hemminger
@ 2020-09-14 21:12       ` Liang, Ma
  2020-09-16 16:34       ` Liang, Ma
  1 sibling, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 21:12 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

On 04 Sep 11:42, Stephen Hemminger wrote:
<snip>
we are very open to discuss design  with other vendor.
> Before this is merged, please work with Arm maintainers to have a version that
> works on Arm 64 as well. Don't think this should be merged unless the two major
> platforms supported by DPDK can work with it. 

> Also, not sure if this mechanism can work with other drivers. You need to
> work with other vendors to show that the same infrastructure can work with
> their hardware. Once again, I don't think this can go in if it only can
> work on Intel.  It needs to work on Broadcom, Mellanox to be useful.
this mechanism should work with any device use a HW ring descriptor mechanism. 
I think most Mellanox and Broadcom NIC can support it easily. 

> Will it work in a VM? Will it work with virtio or vmxnet3?
> 
General speaking, Guest OS is not very easy to use this.
However, virtio is under invetigation.
> Having a single vendor solution is a non-starter for me.
> They don't all have to be there to get it merged, but if the design only
> works on single platform then it is not helpful.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-14 21:01         ` Liang, Ma
@ 2020-09-16 14:53           ` Ananyev, Konstantin
  2020-09-16 16:39             ` Liang, Ma
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-09-16 14:53 UTC (permalink / raw)
  To: Ma, Liang J; +Cc: dev, Hunt, David, Burakov, Anatoly



> >
> > > +
> > > +switch (dev->cb_mode) {
> > > +case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> > > +case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> > > +rte_eth_remove_rx_callback(port_id, 0,
> > > +   dev->cur_pwr_cb);
> > > +break;
> > > +case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> > > +rte_power_freq_max(lcore_id);
> >
> > Stupid q: what makes you think that lcore frequency was max,
> > *before* you setup the callback?
> that is because the rte_power_init() has figured out the system max.
> the init code invocate rte_power_init() already.

So rte_power_init(lcore) always raises lcore frequency to
max possible value?

> >
> > > +rte_eth_remove_rx_callback(port_id, 0,
> > > +   dev->cur_pwr_cb);
> > > +if (rte_power_exit(lcore_id))
> > > +return -EINVAL;
> > > +break;
> > > +}
> > > +
> > > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > > +dev->cur_pwr_cb = NULL;
> > > +dev->cb_mode = 0;
> > > +
> > > +return 0;
> > > +}

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 18:42     ` Stephen Hemminger
  2020-09-14 21:12       ` Liang, Ma
@ 2020-09-16 16:34       ` Liang, Ma
  1 sibling, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-16 16:34 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

On 04 Sep 11:42, Stephen Hemminger wrote:
<snip> 

we have discussed with arm developer in the past.
Please ref https://patches.dpdk.org/patch/70662/
There was no objection, in my opinoin.
Also the API we proposed has experimental tag, other vendor still can change it.

For the ethdev internal ops we introduced should work with any NIC use ring descriptor
writeback mechansim. But we lack the internal sight of Mellanox or Broadcom NIC. 

AF_XDP PMD and virtio-net is under investigation. 

I hope above explaination addressed your concern. 

> Before this is merged, please work with Arm maintainers to have a version that
> works on Arm 64 as well. Don't think this should be merged unless the two major
> platforms supported by DPDK can work with it.
> 
> Also, not sure if this mechanism can work with other drivers. You need to
> work with other vendors to show that the same infrastructure can work with
> their hardware. Once again, I don't think this can go in if it only can
> work on Intel.  It needs to work on Broadcom, Mellanox to be useful.
> 
> Will it work in a VM? Will it work with virtio or vmxnet3?
> 
> Having a single vendor solution is a non-starter for me.
> They don't all have to be there to get it merged, but if the design only
> works on single platform then it is not helpful.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-16 14:53           ` Ananyev, Konstantin
@ 2020-09-16 16:39             ` Liang, Ma
  2020-09-16 16:44               ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Liang, Ma @ 2020-09-16 16:39 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Hunt, David, Burakov, Anatoly

On 16 Sep 07:53, Ananyev, Konstantin wrote:
<snip>
Yes. we only has two gear. min or max. However, user still can customize
their system max with power mgmt python script on Intel platform. 
> So rte_power_init(lcore) always raises lcore frequency to
> max possible value?
> 
> > >
> > > > +rte_eth_remove_rx_callback(port_id, 0,
> > > > +   dev->cur_pwr_cb);
> > > > +if (rte_power_exit(lcore_id))
> > > > +return -EINVAL;
> > > > +break;
> > > > +}
> > > > +
> > > > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > > > +dev->cur_pwr_cb = NULL;
> > > > +dev->cb_mode = 0;
> > > > +
> > > > +return 0;
> > > > +}

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-16 16:39             ` Liang, Ma
@ 2020-09-16 16:44               ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-09-16 16:44 UTC (permalink / raw)
  To: Ma, Liang J; +Cc: dev, Hunt, David, Burakov, Anatoly

> On 16 Sep 07:53, Ananyev, Konstantin wrote:
> <snip>
> Yes. we only has two gear. min or max. However, user still can customize
> their system max with power mgmt python script on Intel platform.

Ok, thanks for explanation.

> > So rte_power_init(lcore) always raises lcore frequency to
> > max possible value?
> >
> > > >
> > > > > +rte_eth_remove_rx_callback(port_id, 0,
> > > > > +   dev->cur_pwr_cb);
> > > > > +if (rte_power_exit(lcore_id))
> > > > > +return -EINVAL;
> > > > > +break;
> > > > > +}
> > > > > +
> > > > > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > > > > +dev->cur_pwr_cb = NULL;
> > > > > +dev->cb_mode = 0;
> > > > > +
> > > > > +return 0;
> > > > > +}

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (8 preceding siblings ...)
  2020-09-06 21:44     ` Ananyev, Konstantin
@ 2020-09-18  5:01     ` Jerin Jacob
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
  10 siblings, 0 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-09-18  5:01 UTC (permalink / raw)
  To: Liang Ma, Honnappa Nagarahalli, Stephen Hemminger
  Cc: dpdk-dev, David Hunt, Anatoly Burakov, Richardson, Bruce,
	Ananyev, Konstantin, Thomas Monjalon

On Fri, Sep 4, 2020 at 3:49 PM Liang Ma <liang.j.ma@intel.com> wrote:
>
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
>
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>


> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
[snip]
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint32_t state, const uint64_t tsc_timestamp)

IMO, We must introduce some arch feature-capability _get_ scheme to tell
the consumer of this API is only supported on x86. Probably as functions[1]
or macro flags scheme and have a stub for the other architectures as the
API marked as generic ie rte_power_* not rte_x86_..

This will help the consumer to create workers based on the instruction features
which can NOT be abstracted as a generic feature across the architectures.


[1]
struct rte_arch_inst_feat {
        uint32_t power_monitor      : 1;  /**< Power monitor */
...
}

void rte_arch_inst_feat_get(struct rte_arch_inst_feat *feat);

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (9 preceding siblings ...)
  2020-09-18  5:01     ` Jerin Jacob
@ 2020-10-02 14:11     ` Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
                         ` (20 more replies)
  10 siblings, 21 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Add new x86 cpuid support for WAITPKG.
This flag indicate processor support umwait/umonitor/tpause
instruction.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/x86/include/rte_cpuflags.h | 2 ++
 lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..5041a830a7 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	/**< UMWAIT/TPAUSE Instructions */
+	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-08  8:33         ` Thomas Monjalon
  2020-10-08 17:15         ` Ananyev, Konstantin
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API Liang Ma
                         ` (19 subsequent siblings)
  20 siblings, 2 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

For more details, Please reference Intel SDM Volume 2.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
 4 files changed, 209 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..cd7f8070ac
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..6dd1cdc939
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-08  8:46         ` Thomas Monjalon
  2020-10-08 22:26         ` Ananyev, Konstantin
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback Liang Ma
                         ` (18 subsequent siblings)
  20 siblings, 2 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Add a simple API allow ethdev get wake up address from PMD.
Also include internal structure update.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.c           | 19 ++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_version.map |  1 +
 4 files changed, 72 insertions(+)

diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index d7668114ca..88253d95f9 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -4804,6 +4804,25 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		      volatile void **wake_addr,
+		      uint64_t *expected, uint64_t *mask)
+{
+	struct rte_eth_dev *dev;
+	uint16_t ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	ret = (*dev->dev_ops->get_wake_addr)
+				(dev->data->rx_queues[queue_id],
+				 wake_addr, expected, mask);
+
+	return ret;
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index d2bf74f128..a6cfe3cd57 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4014,6 +4014,30 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * Retrieve the wake up address from specific queue
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Tx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param wake_addr
+ *   The pointer point to the address which is used for monitoring.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ *
+ * @return
+ *   - 0: Success.
+ *   -EINVAL: Failed to get wake address.
+ */
+__rte_experimental
+int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+			  volatile void **wake_addr,
+			  uint64_t *expected, uint64_t *mask);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index c3062c246c..935d46f25c 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the Wake up address.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param tail_desc_addr
+ *   The pointer point to descriptor address var.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_get_wake_addr_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -713,6 +738,9 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_get_wake_addr_t get_wake_addr;
+	/**< Get wake up address. */
+
 };
 
 /**
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index c95ef5157a..3cb2093980 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -229,6 +229,7 @@ EXPERIMENTAL {
 	# added in 20.11
 	rte_eth_link_speed_to_str;
 	rte_eth_link_to_str;
+	rte_eth_get_wake_addr;
 };
 
 INTERNAL {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-09 16:38         ` Ananyev, Konstantin
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API Liang Ma
                         ` (17 subsequent siblings)
  20 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API support 1 port to multiple core use case.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to relief the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/pmd_mgmt.h            |  49 ++++++
 lib/librte_power/rte_power_pmd_mgmt.c  | 208 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  88 +++++++++++
 lib/librte_power/rte_power_version.map |   4 +
 5 files changed, 352 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/pmd_mgmt.h
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/pmd_mgmt.h b/lib/librte_power/pmd_mgmt.h
new file mode 100644
index 0000000000..756fbe20f7
--- /dev/null
+++ b/lib/librte_power/pmd_mgmt.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _PMD_MGMT_H
+#define _PMD_MGMT_H
+
+/**
+ * @file
+ * Power Management
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< Power mgmt Callback mode */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Empty poll number */
+	uint16_t empty_poll_stats;
+	/**< Callback instance  */
+	const struct rte_eth_rxtx_callback *cur_cb;
+} __rte_cache_aligned;
+
+struct pmd_port_cfg {
+	int  ref_cnt;
+	struct pmd_queue_cfg *queue_cfg;
+} __rte_cache_aligned;
+
+
+
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..35d2af46a4
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,208 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+#include "pmd_mgmt.h"
+
+
+#define EMPTYPOLL_MAX  512
+#define PAUSE_NUM  64
+
+static struct pmd_port_cfg port_cfg[RTE_MAX_ETHPORTS];
+
+static uint16_t
+rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = rte_eth_get_wake_addr(port_id, qidx,
+						    &target_addr, &expected,
+						    &mask);
+			if (ret == 0)
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor(target_addr,
+						  expected, mask,
+						  0, -1ULL);
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+	int i;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			for (i = 0; i < PAUSE_NUM; i++)
+				rte_pause();
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/*scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+
+		}
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scal up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret = 0;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (port_cfg[port_id].queue_cfg == NULL) {
+		port_cfg[port_id].ref_cnt = 0;
+		/* allocate memory for empty poll stats */
+		port_cfg[port_id].queue_cfg  = rte_malloc_socket(NULL,
+					sizeof(struct pmd_queue_cfg)
+					* RTE_MAX_QUEUES_PER_PORT,
+					0, dev->data->numa_node);
+		if (port_cfg[port_id].queue_cfg == NULL)
+			return -ENOMEM;
+	}
+
+	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto failure_handler;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
+			ret = -ENOTSUP;
+			goto failure_handler;
+		}
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+						rte_power_mgmt_umwait, NULL);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		/* init scale freq */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto failure_handler;
+		}
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+					rte_power_mgmt_scalefreq, NULL);
+		break;
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+						rte_power_mgmt_pause, NULL);
+		break;
+	}
+	queue_cfg->cb_mode = mode;
+	port_cfg[port_id].ref_cnt++;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	return ret;
+
+failure_handler:
+	if (port_cfg[port_id].ref_cnt == 0) {
+		rte_free(port_cfg[port_id].queue_cfg);
+		port_cfg[port_id].queue_cfg = NULL;
+	}
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	if (port_cfg[port_id].ref_cnt <= 0)
+		return -EINVAL;
+
+	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED)
+		return -EINVAL;
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+					   queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+					   queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/* it's not recommend to free callback instance here.
+	 * it cause memory leak which is a known issue.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	port_cfg[port_id].ref_cnt--;
+
+	if (port_cfg[port_id].ref_cnt == 0) {
+		rte_free(port_cfg[port_id].queue_cfg);
+		port_cfg[port_id].queue_cfg = NULL;
+	}
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..8b110f1148
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (2 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-09 15:53         ` Ananyev, Konstantin
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 06/10] net/i40e: " Liang Ma
                         ` (16 subsequent siblings)
  20 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 0b98e210e7..30b3f416d4 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_wake_addr        = ixgbe_get_wake_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf5137..7a9fd2aec6 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b22..75020fa2fc 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 06/10] net/i40e: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (3 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-09 16:01         ` Ananyev, Konstantin
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 07/10] net/ice: " Liang Ma
                         ` (15 subsequent siblings)
  20 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 943cfe71dc..cab86f8ec9 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_wake_addr	              = i40e_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 322fc1ed75..c17f27292f 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..f23a2073e3 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 07/10] net/ice: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (4 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 06/10] net/i40e: " Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 08/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
                         ` (14 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index d8ce09d28f..260de5dfd7 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_wake_addr	              = ice_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 93a0ac6918..9e55eca942 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -25,6 +25,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1c23c7541e..c729e474c9 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		      uint64_t *expected, uint64_t *mask);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 08/10] examples/l3fwd-power: enable PMD power mgmt
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (5 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 07/10] net/ice: " Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 09/10] doc: update release notes for PMD power management Liang Ma
                         ` (13 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma

Add pmd power mgmt feature support.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 examples/l3fwd-power/main.c | 44 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index d0e6c9bd77..b1b139129a 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,8 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
+
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,7 +201,8 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
@@ -1750,6 +1753,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1771,6 +1775,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1881,6 +1886,16 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt  mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2437,6 +2452,9 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
+
 	default:
 		return "invalid";
 	}
@@ -2705,6 +2723,12 @@ main(int argc, char **argv)
 			} else if (!check_ptype(portid))
 				rte_exit(EXIT_FAILURE,
 					 "PMD can not provide needed ptypes\n");
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				rte_power_pmd_mgmt_queue_enable(lcore_id,
+							portid, queueid,
+						RTE_POWER_MGMT_TYPE_SCALE);
+
+			}
 		}
 	}
 
@@ -2790,8 +2814,12 @@ main(int argc, char **argv)
 						SKIP_MASTER);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MASTER);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL,
+					 CALL_MASTER);
 	}
 
+
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
 		launch_timer(rte_lcore_id());
 
@@ -2812,6 +2840,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 09/10] doc: update release notes for PMD power management
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (6 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 08/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 10/10] doc: update the programming guide " Liang Ma
                         ` (12 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma

Add release notes for PMD power management

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/rel_notes/release_20_11.rst | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index c2175f37f3..57ac73722a 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**
+
+  * ``rte_eth_get_wake_addr()``
+  * add new eth_dev_ops ``get_wake_addr``
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
@@ -107,6 +112,17 @@ New Features
   * Extern objects and functions can be plugged into the pipeline.
   * Transaction-oriented table updates.
 
+* **Add PMD power management mechanism**
+
+  3 new PMD power managmeent mechanism is added through existing
+  RX_ETH_CALLBACK infrastructure.
+
+  * Add umwait power saving scheme
+  * Add pause power saving scheme
+  * Add frequency scaling power saving scheme
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 10/10] doc: update the programming guide for PMD power management
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (7 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 09/10] doc: update release notes for PMD power management Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-02 14:44       ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Bruce Richardson
                         ` (11 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma

Update programming guide and sample application l3fwd-power document
for PMD power management

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/prog_guide/power_man.rst           | 40 +++++++++++++++++++
 .../sample_app_ug/l3_forward_power_man.rst    | 15 ++++++-
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..c95b948874 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -188,6 +188,43 @@ API Overview for Empty Poll Power Management
 
 * **Detect empty poll state change**: empty poll state change detection algorithm then take action.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Given existing power management mechanism, developer need change application design or code to use it.
+In order to solve the problem, it's very helpful to make the design transparent to application.
+The proposed solution is to leverage RX_CALLBACK mechanism which allow three different power management
+methodology co exist. The trigger condition is empty poll number beyond defined threshold.
+
+  * umwait/umonitor
+
+   The new umwait/umonitor instruction monitoring the wake address then transfer processor to sub-state.
+   Once the content of address is changed, the processor will be wake up from the sub-state. Timeout is
+   setup as well, in case, there is no wake event happen, processor still will wake up after timeout
+   timer expired.
+
+  * Pause instruction
+
+   Instead of move the core into deeper C state, this lightweight method use Pause instruction
+   to relief the processor from busy polling.
+
+  * Frequency Scaling
+
+   Reuse exist rte power library to scale up/down core frequency
+   depend on traffic volume.
+
+The proposed solution support multiple port and each port can map to multiple core. But 1 core only can map
+1 queue(regardless which port). In theory, each queue belongs to same port can apply different power scheme.
+It's strongly recommend to use same power scheme for all queues belong to same port.
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
@@ -200,3 +237,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 0cc6f2e62e..82f9ac849c 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -107,7 +107,9 @@ where,
 
 *   --empty-poll: Traffic Aware power management. See below for details
 
-*   --telemetry:  Telemetry mode.
+*   --telemetry: Telemetry mode.
+
+*   --pmd-mgmt: PMD power management mode.
 
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
@@ -459,3 +461,14 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./examples/l3fwd-power/build/l3fwd-power --pmd-mgmt -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)"
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (8 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 10/10] doc: update the programming guide " Liang Ma
@ 2020-10-02 14:44       ` Bruce Richardson
  2020-10-08 22:08       ` Ananyev, Konstantin
                         ` (10 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Bruce Richardson @ 2020-10-02 14:44 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, stephen, konstantin.ananyev, Anatoly Burakov

On Fri, Oct 02, 2020 at 03:11:50PM +0100, Liang Ma wrote:
> Add new x86 cpuid support for WAITPKG.
> This flag indicate processor support umwait/umonitor/tpause
> instruction.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_eal/x86/include/rte_cpuflags.h | 2 ++
>  lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d1..5041a830a7 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
>  	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
>  
> +	/**< UMWAIT/TPAUSE Instructions */
> +	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
Typo: UMINITOR

>  	/* The last item */
>  	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
>  };
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 30439e7951..0325c4b93b 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
>  	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
>  	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
>  
> +	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
> +
>  	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
>  	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
@ 2020-10-08  8:33         ` Thomas Monjalon
  2020-10-08  8:44           ` Jerin Jacob
  2020-10-08 17:15         ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-08  8:33 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, david.hunt, stephen, konstantin.ananyev, Anatoly Burakov,
	Liang Ma, honnappa.nagarahalli, ruifeng.wang, David Christensen,
	jerinj

> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> For more details, Please reference Intel SDM Volume 2.

I really would like to see feedbacks from other arch maintainers.
Unfortunately they were not Cc'ed.

Also please mark the new functions as experimental.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08  8:33         ` Thomas Monjalon
@ 2020-10-08  8:44           ` Jerin Jacob
  2020-10-08  9:41             ` Thomas Monjalon
  2020-10-08 13:26             ` Burakov, Anatoly
  0 siblings, 2 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-10-08  8:44 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Liang Ma, dpdk-dev, David Hunt, Stephen Hemminger, Ananyev,
	Konstantin, Anatoly Burakov, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> > Add two new power management intrinsics, and provide an implementation
> > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > are implemented as raw byte opcodes because there is not yet widespread
> > compiler support for these instructions.
> >
> > The power management instructions provide an architecture-specific
> > function to either wait until a specified TSC timestamp is reached, or
> > optionally wait until either a TSC timestamp is reached or a memory
> > location is written to. The monitor function also provides an optional
> > comparison, to avoid sleeping when the expected write has already
> > happened, and no more writes are expected.
> >
> > For more details, Please reference Intel SDM Volume 2.
>
> I really would like to see feedbacks from other arch maintainers.
> Unfortunately they were not Cc'ed.

Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
http://mails.dpdk.org/archives/dev/2020-September/181646.html

> Also please mark the new functions as experimental.
>
>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API Liang Ma
@ 2020-10-08  8:46         ` Thomas Monjalon
  2020-10-08 11:39           ` Ananyev, Konstantin
  2020-10-08 22:26         ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-08  8:46 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, david.hunt, stephen, konstantin.ananyev, Anatoly Burakov,
	Liang Ma, ferruh.yigit, arybchenko, honnappa.nagarahalli,
	ruifeng.wang, jerinj, David Christensen

> +/**
> + * Retrieve the wake up address from specific queue
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Tx queue on the Ethernet device for which information
> + *   will be retrieved.
> + * @param wake_addr
> + *   The pointer point to the address which is used for monitoring.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + *
> + * @return
> + *   - 0: Success.
> + *   -EINVAL: Failed to get wake address.
> + */
> +__rte_experimental
> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +                         volatile void **wake_addr,
> +                         uint64_t *expected, uint64_t *mask);

It looks to be a very low-level API.
Can't we do something more "ready-to-use" at ethdev level?

Cc'in the relevant maintainers...
Note: sorry this comment come late but ethdev maintainers were not Cc.
Reminder: having no feedback is not a good sign, you should request comments.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08  8:44           ` Jerin Jacob
@ 2020-10-08  9:41             ` Thomas Monjalon
  2020-10-08 13:26             ` Burakov, Anatoly
  1 sibling, 0 replies; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-08  9:41 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, David Hunt, Stephen Hemminger, Ananyev, Konstantin,
	Anatoly Burakov, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

08/10/2020 10:44, Jerin Jacob:
> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >
> > > Add two new power management intrinsics, and provide an implementation
> > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > are implemented as raw byte opcodes because there is not yet widespread
> > > compiler support for these instructions.
> > >
> > > The power management instructions provide an architecture-specific
> > > function to either wait until a specified TSC timestamp is reached, or
> > > optionally wait until either a TSC timestamp is reached or a memory
> > > location is written to. The monitor function also provides an optional
> > > comparison, to avoid sleeping when the expected write has already
> > > happened, and no more writes are expected.
> > >
> > > For more details, Please reference Intel SDM Volume 2.
> >
> > I really would like to see feedbacks from other arch maintainers.
> > Unfortunately they were not Cc'ed.
> 
> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> http://mails.dpdk.org/archives/dev/2020-September/181646.html

This comment was sent on September 18.
Later this v4 was sent without replying to the comments.
This is blocking the series.
I am considering this feature as low priority.

> > Also please mark the new functions as experimental.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API
  2020-10-08  8:46         ` Thomas Monjalon
@ 2020-10-08 11:39           ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-08 11:39 UTC (permalink / raw)
  To: Thomas Monjalon, Ma, Liang J
  Cc: dev, Hunt, David, stephen, Burakov, Anatoly, Ma, Liang J, Yigit,
	Ferruh, arybchenko, honnappa.nagarahalli, ruifeng.wang, jerinj,
	David Christensen



> > +/**
> > + * Retrieve the wake up address from specific queue
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The Tx queue on the Ethernet device for which information
> > + *   will be retrieved.
> > + * @param wake_addr
> > + *   The pointer point to the address which is used for monitoring.
> > + * @param expected
> > + *   The pointer point to value to be expected when descriptor is set.
> > + * @param mask
> > + *   The pointer point to comparison bitmask for the expected value.
> > + *
> > + * @return
> > + *   - 0: Success.
> > + *   -EINVAL: Failed to get wake address.
> > + */
> > +__rte_experimental
> > +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> > +                         volatile void **wake_addr,
> > +                         uint64_t *expected, uint64_t *mask);
> 
> It looks to be a very low-level API.
> Can't we do something more "ready-to-use" at ethdev level?

I think that series provides both:
There is a low-level API at ethdev/eal to retrieve information to wait for
and actual function to put core to sleep.
Plus there is a high-level API at rte_power lib:
rte_power_pmd_mgmt_queue_enable()/rte_power_pmd_mgmt_queue_disable()
that uses these low-level ones and puts some high-level logic around it.
From my perspective it is a good design choice,
as it keeps all power-related burden inside rte_power library and
provides user a lot of flexibility in terms of API usage.
Konstantin 

> 
> Cc'in the relevant maintainers...
> Note: sorry this comment come late but ethdev maintainers were not Cc.
> Reminder: having no feedback is not a good sign, you should request comments.
> 


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08  8:44           ` Jerin Jacob
  2020-10-08  9:41             ` Thomas Monjalon
@ 2020-10-08 13:26             ` Burakov, Anatoly
  2020-10-08 15:13               ` Jerin Jacob
  1 sibling, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-08 13:26 UTC (permalink / raw)
  To: Jerin Jacob, Thomas Monjalon
  Cc: Liang Ma, dpdk-dev, David Hunt, Stephen Hemminger, Ananyev,
	Konstantin, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>
>>> Add two new power management intrinsics, and provide an implementation
>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>> are implemented as raw byte opcodes because there is not yet widespread
>>> compiler support for these instructions.
>>>
>>> The power management instructions provide an architecture-specific
>>> function to either wait until a specified TSC timestamp is reached, or
>>> optionally wait until either a TSC timestamp is reached or a memory
>>> location is written to. The monitor function also provides an optional
>>> comparison, to avoid sleeping when the expected write has already
>>> happened, and no more writes are expected.
>>>
>>> For more details, Please reference Intel SDM Volume 2.
>>
>> I really would like to see feedbacks from other arch maintainers.
>> Unfortunately they were not Cc'ed.
> 
> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> 
>> Also please mark the new functions as experimental.
>>
>>

Hi Jerin,

 > IMO, We must introduce some arch feature-capability _get_ scheme to tell
 > the consumer of this API is only supported on x86. Probably as 
functions[1]
 > or macro flags scheme and have a stub for the other architectures as the
 > API marked as generic ie rte_power_* not rte_x86_..
 >
 > This will help the consumer to create workers based on the 
instruction features
 > which can NOT be abstracted as a generic feature across the 
architectures.

I'm not entirely sure what you mean by that.

I mean, yes, we should have added stubs for other architectures, and we 
will add those in future revisions, but what does your proposed runtime 
check accomplish that cannot currently be done with CPUID flags?

If you look at patch 1 [1], we added CPUID flags that the user can 
check, and in fact this is precisely what we do in patch 4 [2] before 
enabling the UMWAIT path. We could perhaps document this better and 
outline the dependency on the WAITPKG CPUID flag more explicitly, but 
otherwise i don't see how what you're proposing isn't already possible 
to do.

[1] http://patches.dpdk.org/patch/79539/
[2] http://patches.dpdk.org/patch/79540/ , function 
rte_power_pmd_mgmt_queue_enable()

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08 13:26             ` Burakov, Anatoly
@ 2020-10-08 15:13               ` Jerin Jacob
  2020-10-08 17:07                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-08 15:13 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Thomas Monjalon, Liang Ma, dpdk-dev, David Hunt,
	Stephen Hemminger, Ananyev, Konstantin, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>
> >>> Add two new power management intrinsics, and provide an implementation
> >>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>> are implemented as raw byte opcodes because there is not yet widespread
> >>> compiler support for these instructions.
> >>>
> >>> The power management instructions provide an architecture-specific
> >>> function to either wait until a specified TSC timestamp is reached, or
> >>> optionally wait until either a TSC timestamp is reached or a memory
> >>> location is written to. The monitor function also provides an optional
> >>> comparison, to avoid sleeping when the expected write has already
> >>> happened, and no more writes are expected.
> >>>
> >>> For more details, Please reference Intel SDM Volume 2.
> >>
> >> I really would like to see feedbacks from other arch maintainers.
> >> Unfortunately they were not Cc'ed.
> >
> > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >
> >> Also please mark the new functions as experimental.
> >>
> >>
>
> Hi Jerin,

Hi Anatoly,

>
>  > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>  > the consumer of this API is only supported on x86. Probably as
> functions[1]
>  > or macro flags scheme and have a stub for the other architectures as the
>  > API marked as generic ie rte_power_* not rte_x86_..
>  >
>  > This will help the consumer to create workers based on the
> instruction features
>  > which can NOT be abstracted as a generic feature across the
> architectures.
>
> I'm not entirely sure what you mean by that.
>
> I mean, yes, we should have added stubs for other architectures, and we
> will add those in future revisions, but what does your proposed runtime
> check accomplish that cannot currently be done with CPUID flags?


RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.



>
> If you look at patch 1 [1], we added CPUID flags that the user can
> check, and in fact this is precisely what we do in patch 4 [2] before
> enabling the UMWAIT path. We could perhaps document this better and
> outline the dependency on the WAITPKG CPUID flag more explicitly, but
> otherwise i don't see how what you're proposing isn't already possible
> to do.
>
> [1] http://patches.dpdk.org/patch/79539/
> [2] http://patches.dpdk.org/patch/79540/ , function
> rte_power_pmd_mgmt_queue_enable()
>
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08 15:13               ` Jerin Jacob
@ 2020-10-08 17:07                 ` Ananyev, Konstantin
  2020-10-09  5:42                   ` Jerin Jacob
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-08 17:07 UTC (permalink / raw)
  To: Jerin Jacob, Burakov, Anatoly
  Cc: Thomas Monjalon, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

> 
> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
> >
> > On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > >>
> > >>> Add two new power management intrinsics, and provide an implementation
> > >>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > >>> are implemented as raw byte opcodes because there is not yet widespread
> > >>> compiler support for these instructions.
> > >>>
> > >>> The power management instructions provide an architecture-specific
> > >>> function to either wait until a specified TSC timestamp is reached, or
> > >>> optionally wait until either a TSC timestamp is reached or a memory
> > >>> location is written to. The monitor function also provides an optional
> > >>> comparison, to avoid sleeping when the expected write has already
> > >>> happened, and no more writes are expected.
> > >>>
> > >>> For more details, Please reference Intel SDM Volume 2.
> > >>
> > >> I really would like to see feedbacks from other arch maintainers.
> > >> Unfortunately they were not Cc'ed.
> > >
> > > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> > >
> > >> Also please mark the new functions as experimental.
> > >>
> > >>
> >
> > Hi Jerin,
> 
> Hi Anatoly,
> 
> >
> >  > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >  > the consumer of this API is only supported on x86. Probably as
> > functions[1]
> >  > or macro flags scheme and have a stub for the other architectures as the
> >  > API marked as generic ie rte_power_* not rte_x86_..
> >  >
> >  > This will help the consumer to create workers based on the
> > instruction features
> >  > which can NOT be abstracted as a generic feature across the
> > architectures.
> >
> > I'm not entirely sure what you mean by that.
> >
> > I mean, yes, we should have added stubs for other architectures, and we
> > will add those in future revisions, but what does your proposed runtime
> > check accomplish that cannot currently be done with CPUID flags?
> 
> 
> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.


I am agree with Jerin, that we need some generic way to
figure-out does platform supports power_monitor() or not.
Though not sure do we need to create a new feature-get framework here...
Might be just something like:
 rte_power_monitor(...) == -ENOTSUP
be enough indication for that?
So user can just do:
if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
	/* not supported  path */
}

To check is that feature supported or not.

> >
> > If you look at patch 1 [1], we added CPUID flags that the user can
> > check, and in fact this is precisely what we do in patch 4 [2] before
> > enabling the UMWAIT path. We could perhaps document this better and
> > outline the dependency on the WAITPKG CPUID flag more explicitly, but
> > otherwise i don't see how what you're proposing isn't already possible
> > to do.
> >
> > [1] http://patches.dpdk.org/patch/79539/
> > [2] http://patches.dpdk.org/patch/79540/ , function
> > rte_power_pmd_mgmt_queue_enable()
> >
> > --
> > Thanks,
> > Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
  2020-10-08  8:33         ` Thomas Monjalon
@ 2020-10-08 17:15         ` Ananyev, Konstantin
  2020-10-09  9:11           ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-08 17:15 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly

> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.

I think what this API is missing - a function to wakeup sleeping core.
If user can/should use some system call to achieve that, then at least
it has to be clearly documented, even better some wrapper provided.

> 
> For more details, Please reference Intel SDM Volume 2.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
>  4 files changed, 209 insertions(+)
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
> 
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..cd7f8070ac
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp);
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index cd09027958..3a12e87e19 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -60,6 +60,7 @@ generic_headers = files(
>  	'generic/rte_memcpy.h',
>  	'generic/rte_pause.h',
>  	'generic/rte_prefetch.h',
> +	'generic/rte_power_intrinsics.h',
>  	'generic/rte_rwlock.h',
>  	'generic/rte_spinlock.h',
>  	'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>  	'rte_io.h',
>  	'rte_memcpy.h',
>  	'rte_prefetch.h',
> +	'rte_power_intrinsics.h',
>  	'rte_pause.h',
>  	'rte_rtm.h',
>  	'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..6dd1cdc939
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +
> +	if (value_mask) {
> +		const uint64_t cur_value = *(const volatile uint64_t *)p;
> +		const uint64_t masked = cur_value & value_mask;
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return 0;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
> +		/*
> +		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
> +		 * onto the stack, then pop them back into `rflags` so that
> +		 * we can read it.
> +		 */
> +		"pushf;\n"
> +		"pop %0;\n"
> +		: "=r"(rflags)
> +		: "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction. For more information about its usage,
> + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
> + * Manual.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
> +		     /*
> +		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
> +		      * onto the stack, then pop them back into `rflags` so that
> +		      * we can read it.
> +		      */
> +		     "pushf;\n"
> +		     "pop %0;\n"
> +		     : "=r"(rflags)
> +		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (9 preceding siblings ...)
  2020-10-02 14:44       ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Bruce Richardson
@ 2020-10-08 22:08       ` Ananyev, Konstantin
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
                         ` (9 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-08 22:08 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly


> Add new x86 cpuid support for WAITPKG.
> This flag indicate processor support umwait/umonitor/tpause
> instruction.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_eal/x86/include/rte_cpuflags.h | 2 ++
>  lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d1..5041a830a7 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
>  	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
> 
> +	/**< UMWAIT/TPAUSE Instructions */
> +	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
>  	/* The last item */
>  	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
>  };
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 30439e7951..0325c4b93b 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
>  	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
>  	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
> 
> +	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
> +
>  	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
>  	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
> 
> --

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API Liang Ma
  2020-10-08  8:46         ` Thomas Monjalon
@ 2020-10-08 22:26         ` Ananyev, Konstantin
  2020-10-09 16:11           ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-08 22:26 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly

> 
> Add a simple API allow ethdev get wake up address from PMD.
> Also include internal structure update.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_ethdev/rte_ethdev.c           | 19 ++++++++++++++++
>  lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_version.map |  1 +
>  4 files changed, 72 insertions(+)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index d7668114ca..88253d95f9 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -4804,6 +4804,25 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
>  }
> 
> +int
> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +		      volatile void **wake_addr,
> +		      uint64_t *expected, uint64_t *mask)
> +{
> +	struct rte_eth_dev *dev;
> +	uint16_t ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +
> +	dev = &rte_eth_devices[port_id];
> +
> +	ret = (*dev->dev_ops->get_wake_addr)
> +				(dev->data->rx_queues[queue_id],
> +				 wake_addr, expected, mask);


This is an optional dev_ops, so I think you need to check that get_wake_addr()
is defined for that PMD.
Plus you need to check that queue_id is valid.

> +
> +	return ret;
> +}
> +
>  int
>  rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>  			     struct rte_ether_addr *mc_addr_set,
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index d2bf74f128..a6cfe3cd57 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -4014,6 +4014,30 @@ __rte_experimental
>  int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  	struct rte_eth_burst_mode *mode);
> 
> +/**
> + * Retrieve the wake up address from specific queue
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Tx queue on the Ethernet device for which information
> + *   will be retrieved.
> + * @param wake_addr
> + *   The pointer point to the address which is used for monitoring.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + *
> + * @return
> + *   - 0: Success.
> + *   -EINVAL: Failed to get wake address.
> + */
> +__rte_experimental
> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +			  volatile void **wake_addr,
> +			  uint64_t *expected, uint64_t *mask);
> +
>  /**
>   * Retrieve device registers and register attributes (number of registers and
>   * register size)
> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
> index c3062c246c..935d46f25c 100644
> --- a/lib/librte_ethdev/rte_ethdev_driver.h
> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
> @@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>  	 uint16_t nb_tx_desc,
>  	 const struct rte_eth_hairpin_conf *hairpin_conf);
> 
> +/**
> + * @internal
> + * Get the Wake up address.
> + *
> + * @param rxq
> + *   Ethdev queue pointer.
> + * @param tail_desc_addr
> + *   The pointer point to descriptor address var.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success.
> + * @retval -EINVAL
> + *   Failed to get descriptor address.
> + */
> +typedef int (*eth_get_wake_addr_t)
> +	(void *rxq, volatile void **tail_desc_addr,
> +	 uint64_t *expected, uint64_t *mask);
> +
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -713,6 +738,9 @@ struct eth_dev_ops {
>  	/**< Set up device RX hairpin queue. */
>  	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>  	/**< Set up device TX hairpin queue. */
> +	eth_get_wake_addr_t get_wake_addr;
> +	/**< Get wake up address. */
> +
>  };
> 
>  /**
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> index c95ef5157a..3cb2093980 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -229,6 +229,7 @@ EXPERIMENTAL {
>  	# added in 20.11
>  	rte_eth_link_speed_to_str;
>  	rte_eth_link_to_str;
> +	rte_eth_get_wake_addr;
>  };
> 
>  INTERNAL {
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08 17:07                 ` Ananyev, Konstantin
@ 2020-10-09  5:42                   ` Jerin Jacob
  2020-10-09  9:25                     ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-09  5:42 UTC (permalink / raw)
  To: Ananyev, Konstantin, David Marchand
  Cc: Burakov, Anatoly, Thomas Monjalon, Ma, Liang J, dpdk-dev, Hunt,
	David, Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
> >
> > On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> > >
> > > On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > > > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > >>
> > > >>> Add two new power management intrinsics, and provide an implementation
> > > >>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > >>> are implemented as raw byte opcodes because there is not yet widespread
> > > >>> compiler support for these instructions.
> > > >>>
> > > >>> The power management instructions provide an architecture-specific
> > > >>> function to either wait until a specified TSC timestamp is reached, or
> > > >>> optionally wait until either a TSC timestamp is reached or a memory
> > > >>> location is written to. The monitor function also provides an optional
> > > >>> comparison, to avoid sleeping when the expected write has already
> > > >>> happened, and no more writes are expected.
> > > >>>
> > > >>> For more details, Please reference Intel SDM Volume 2.
> > > >>
> > > >> I really would like to see feedbacks from other arch maintainers.
> > > >> Unfortunately they were not Cc'ed.
> > > >
> > > > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > > > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> > > >
> > > >> Also please mark the new functions as experimental.
> > > >>
> > > >>
> > >
> > > Hi Jerin,
> >
> > Hi Anatoly,
> >
> > >
> > >  > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> > >  > the consumer of this API is only supported on x86. Probably as
> > > functions[1]
> > >  > or macro flags scheme and have a stub for the other architectures as the
> > >  > API marked as generic ie rte_power_* not rte_x86_..
> > >  >
> > >  > This will help the consumer to create workers based on the
> > > instruction features
> > >  > which can NOT be abstracted as a generic feature across the
> > > architectures.
> > >
> > > I'm not entirely sure what you mean by that.
> > >
> > > I mean, yes, we should have added stubs for other architectures, and we
> > > will add those in future revisions, but what does your proposed runtime
> > > check accomplish that cannot currently be done with CPUID flags?
> >
> >
> > RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> > i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> > and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> > I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>
>
> I am agree with Jerin, that we need some generic way to
> figure-out does platform supports power_monitor() or not.
> Though not sure do we need to create a new feature-get framework here...

That's works too. Some means of generic probing is fine. Following
schemed needs
more documentation on that usage, as, it is not straight forward compare to
feature-get framework. Also, on the other thread, we are adding the
new instructions like
demote cacheline etc, maybe if the user wants to KNOW if the arch
supports it then
the feature-get framework is good.
If we think, there is no other usecase for generic arch feature-get
framework then
we can keep the below scheme else generic arch feature is better for
more forward
looking use cases.

> Might be just something like:
>  rte_power_monitor(...) == -ENOTSUP
> be enough indication for that?
> So user can just do:
> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>         /* not supported  path */
> }
>
> To check is that feature supported or not.


>
> > >
> > > If you look at patch 1 [1], we added CPUID flags that the user can
> > > check, and in fact this is precisely what we do in patch 4 [2] before
> > > enabling the UMWAIT path. We could perhaps document this better and
> > > outline the dependency on the WAITPKG CPUID flag more explicitly, but
> > > otherwise i don't see how what you're proposing isn't already possible
> > > to do.
> > >
> > > [1] http://patches.dpdk.org/patch/79539/
> > > [2] http://patches.dpdk.org/patch/79540/ , function
> > > rte_power_pmd_mgmt_queue_enable()
> > >
> > > --
> > > Thanks,
> > > Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08 17:15         ` Ananyev, Konstantin
@ 2020-10-09  9:11           ` Burakov, Anatoly
  2020-10-09 15:39             ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09  9:11 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
>>
>> Add two new power management intrinsics, and provide an implementation
>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>> are implemented as raw byte opcodes because there is not yet widespread
>> compiler support for these instructions.
>>
>> The power management instructions provide an architecture-specific
>> function to either wait until a specified TSC timestamp is reached, or
>> optionally wait until either a TSC timestamp is reached or a memory
>> location is written to. The monitor function also provides an optional
>> comparison, to avoid sleeping when the expected write has already
>> happened, and no more writes are expected.
> 
> I think what this API is missing - a function to wakeup sleeping core.
> If user can/should use some system call to achieve that, then at least
> it has to be clearly documented, even better some wrapper provided.

I don't think it's possible to do that without severely overcomplicating 
the intrinsic and its usage, because AFAIK the only way to wake up a 
sleeping core would be to send some kind of interrupt to the core, or 
trigger a write to the cache-line in question.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  5:42                   ` Jerin Jacob
@ 2020-10-09  9:25                     ` Burakov, Anatoly
  2020-10-09  9:29                       ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09  9:25 UTC (permalink / raw)
  To: Jerin Jacob, Ananyev, Konstantin, David Marchand
  Cc: Thomas Monjalon, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
>>
>>>
>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>> <anatoly.burakov@intel.com> wrote:
>>>>
>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>
>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>> compiler support for these instructions.
>>>>>>>
>>>>>>> The power management instructions provide an architecture-specific
>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>> happened, and no more writes are expected.
>>>>>>>
>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>
>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>> Unfortunately they were not Cc'ed.
>>>>>
>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>
>>>>>> Also please mark the new functions as experimental.
>>>>>>
>>>>>>
>>>>
>>>> Hi Jerin,
>>>
>>> Hi Anatoly,
>>>
>>>>
>>>>   > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>   > the consumer of this API is only supported on x86. Probably as
>>>> functions[1]
>>>>   > or macro flags scheme and have a stub for the other architectures as the
>>>>   > API marked as generic ie rte_power_* not rte_x86_..
>>>>   >
>>>>   > This will help the consumer to create workers based on the
>>>> instruction features
>>>>   > which can NOT be abstracted as a generic feature across the
>>>> architectures.
>>>>
>>>> I'm not entirely sure what you mean by that.
>>>>
>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>> will add those in future revisions, but what does your proposed runtime
>>>> check accomplish that cannot currently be done with CPUID flags?
>>>
>>>
>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>
>>
>> I am agree with Jerin, that we need some generic way to
>> figure-out does platform supports power_monitor() or not.
>> Though not sure do we need to create a new feature-get framework here...
> 
> That's works too. Some means of generic probing is fine. Following
> schemed needs
> more documentation on that usage, as, it is not straight forward compare to
> feature-get framework. Also, on the other thread, we are adding the
> new instructions like
> demote cacheline etc, maybe if the user wants to KNOW if the arch
> supports it then
> the feature-get framework is good.
> If we think, there is no other usecase for generic arch feature-get
> framework then
> we can keep the below scheme else generic arch feature is better for
> more forward
> looking use cases.
> 
>> Might be just something like:
>>   rte_power_monitor(...) == -ENOTSUP
>> be enough indication for that?
>> So user can just do:
>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>          /* not supported  path */
>> }
>>
>> To check is that feature supported or not.
> 
> 

Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think 
we can safely make this intrinsic as a noop on other archs as well, as 
it's functionally identical to waking up immediately.

If we're not creating this for CLDEMOTE, we don't need it here as well. 
If we do need it for this, then we arguably need it for CLDEMOTE too.

>>
>>>>
>>>> If you look at patch 1 [1], we added CPUID flags that the user can
>>>> check, and in fact this is precisely what we do in patch 4 [2] before
>>>> enabling the UMWAIT path. We could perhaps document this better and
>>>> outline the dependency on the WAITPKG CPUID flag more explicitly, but
>>>> otherwise i don't see how what you're proposing isn't already possible
>>>> to do.
>>>>
>>>> [1] http://patches.dpdk.org/patch/79539/
>>>> [2] http://patches.dpdk.org/patch/79540/ , function
>>>> rte_power_pmd_mgmt_queue_enable()
>>>>
>>>> --
>>>> Thanks,
>>>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  9:25                     ` Burakov, Anatoly
@ 2020-10-09  9:29                       ` Thomas Monjalon
  2020-10-09  9:40                         ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-09  9:29 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Jerin Jacob, Ananyev, Konstantin, David Marchand, Ma, Liang J,
	dpdk-dev, Hunt, David, Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

09/10/2020 11:25, Burakov, Anatoly:
> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> > On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> >>
> >>>
> >>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>> <anatoly.burakov@intel.com> wrote:
> >>>>
> >>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>
> >>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>> compiler support for these instructions.
> >>>>>>>
> >>>>>>> The power management instructions provide an architecture-specific
> >>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>> happened, and no more writes are expected.
> >>>>>>>
> >>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>
> >>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>> Unfortunately they were not Cc'ed.
> >>>>>
> >>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>
> >>>>>> Also please mark the new functions as experimental.
> >>>>>>
> >>>>>>
> >>>>
> >>>> Hi Jerin,
> >>>
> >>> Hi Anatoly,
> >>>
> >>>>
> >>>>   > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>   > the consumer of this API is only supported on x86. Probably as
> >>>> functions[1]
> >>>>   > or macro flags scheme and have a stub for the other architectures as the
> >>>>   > API marked as generic ie rte_power_* not rte_x86_..
> >>>>   >
> >>>>   > This will help the consumer to create workers based on the
> >>>> instruction features
> >>>>   > which can NOT be abstracted as a generic feature across the
> >>>> architectures.
> >>>>
> >>>> I'm not entirely sure what you mean by that.
> >>>>
> >>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>> will add those in future revisions, but what does your proposed runtime
> >>>> check accomplish that cannot currently be done with CPUID flags?
> >>>
> >>>
> >>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>
> >>
> >> I am agree with Jerin, that we need some generic way to
> >> figure-out does platform supports power_monitor() or not.
> >> Though not sure do we need to create a new feature-get framework here...
> > 
> > That's works too. Some means of generic probing is fine. Following
> > schemed needs
> > more documentation on that usage, as, it is not straight forward compare to
> > feature-get framework. Also, on the other thread, we are adding the
> > new instructions like
> > demote cacheline etc, maybe if the user wants to KNOW if the arch
> > supports it then
> > the feature-get framework is good.
> > If we think, there is no other usecase for generic arch feature-get
> > framework then
> > we can keep the below scheme else generic arch feature is better for
> > more forward
> > looking use cases.
> > 
> >> Might be just something like:
> >>   rte_power_monitor(...) == -ENOTSUP
> >> be enough indication for that?
> >> So user can just do:
> >> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>          /* not supported  path */
> >> }
> >>
> >> To check is that feature supported or not.
> > 
> > 
> 
> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think 
> we can safely make this intrinsic as a noop on other archs as well, as 
> it's functionally identical to waking up immediately.
> 
> If we're not creating this for CLDEMOTE, we don't need it here as well. 
> If we do need it for this, then we arguably need it for CLDEMOTE too.

Sorry I don't understand what you mean, too many "it" and "this" :)



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  9:29                       ` Thomas Monjalon
@ 2020-10-09  9:40                         ` Burakov, Anatoly
  2020-10-09  9:54                           ` Jerin Jacob
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09  9:40 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Jerin Jacob, Ananyev, Konstantin, David Marchand, Ma, Liang J,
	dpdk-dev, Hunt, David, Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> 09/10/2020 11:25, Burakov, Anatoly:
>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>> <konstantin.ananyev@intel.com> wrote:
>>>>
>>>>>
>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>
>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>
>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>> compiler support for these instructions.
>>>>>>>>>
>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>
>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>
>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>
>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>
>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> Hi Jerin,
>>>>>
>>>>> Hi Anatoly,
>>>>>
>>>>>>
>>>>>>    > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>    > the consumer of this API is only supported on x86. Probably as
>>>>>> functions[1]
>>>>>>    > or macro flags scheme and have a stub for the other architectures as the
>>>>>>    > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>    >
>>>>>>    > This will help the consumer to create workers based on the
>>>>>> instruction features
>>>>>>    > which can NOT be abstracted as a generic feature across the
>>>>>> architectures.
>>>>>>
>>>>>> I'm not entirely sure what you mean by that.
>>>>>>
>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>
>>>>>
>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>
>>>>
>>>> I am agree with Jerin, that we need some generic way to
>>>> figure-out does platform supports power_monitor() or not.
>>>> Though not sure do we need to create a new feature-get framework here...
>>>
>>> That's works too. Some means of generic probing is fine. Following
>>> schemed needs
>>> more documentation on that usage, as, it is not straight forward compare to
>>> feature-get framework. Also, on the other thread, we are adding the
>>> new instructions like
>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>> supports it then
>>> the feature-get framework is good.
>>> If we think, there is no other usecase for generic arch feature-get
>>> framework then
>>> we can keep the below scheme else generic arch feature is better for
>>> more forward
>>> looking use cases.
>>>
>>>> Might be just something like:
>>>>    rte_power_monitor(...) == -ENOTSUP
>>>> be enough indication for that?
>>>> So user can just do:
>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>           /* not supported  path */
>>>> }
>>>>
>>>> To check is that feature supported or not.
>>>
>>>
>>
>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>> we can safely make this intrinsic as a noop on other archs as well, as
>> it's functionally identical to waking up immediately.
>>
>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> 
> Sorry I don't understand what you mean, too many "it" and "this" :)
> 

Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't 
exist on other archs, this doesn't too, so it's a fairly similar 
situation. Stubbing UMWAIT with a noop is a valid approach because it's 
equivalent to sleeping and then immediately waking up (which can happen 
for a host of reasons unrelated to the code itself).

I'm not against a generic feature-get framework, i'm just pointing out 
that if this is what's preventing the merge, it should prevent the merge 
of CLDEMOTE as well, yet Jerin has acked that one and has explicitly 
stated that he's OK with leaving CLDEMOTE as a noop on other architectures.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  9:40                         ` Burakov, Anatoly
@ 2020-10-09  9:54                           ` Jerin Jacob
  2020-10-09 10:03                             ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-09  9:54 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Thomas Monjalon, Ananyev, Konstantin, David Marchand, Ma,
	Liang J, dpdk-dev, Hunt, David, Stephen Hemminger,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> > 09/10/2020 11:25, Burakov, Anatoly:
> >> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>> <konstantin.ananyev@intel.com> wrote:
> >>>>
> >>>>>
> >>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>
> >>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>
> >>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>> compiler support for these instructions.
> >>>>>>>>>
> >>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>
> >>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>
> >>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>
> >>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>
> >>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>> Hi Jerin,
> >>>>>
> >>>>> Hi Anatoly,
> >>>>>
> >>>>>>
> >>>>>>    > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>    > the consumer of this API is only supported on x86. Probably as
> >>>>>> functions[1]
> >>>>>>    > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>    > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>    >
> >>>>>>    > This will help the consumer to create workers based on the
> >>>>>> instruction features
> >>>>>>    > which can NOT be abstracted as a generic feature across the
> >>>>>> architectures.
> >>>>>>
> >>>>>> I'm not entirely sure what you mean by that.
> >>>>>>
> >>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>
> >>>>>
> >>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>
> >>>>
> >>>> I am agree with Jerin, that we need some generic way to
> >>>> figure-out does platform supports power_monitor() or not.
> >>>> Though not sure do we need to create a new feature-get framework here...
> >>>
> >>> That's works too. Some means of generic probing is fine. Following
> >>> schemed needs
> >>> more documentation on that usage, as, it is not straight forward compare to
> >>> feature-get framework. Also, on the other thread, we are adding the
> >>> new instructions like
> >>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>> supports it then
> >>> the feature-get framework is good.
> >>> If we think, there is no other usecase for generic arch feature-get
> >>> framework then
> >>> we can keep the below scheme else generic arch feature is better for
> >>> more forward
> >>> looking use cases.
> >>>
> >>>> Might be just something like:
> >>>>    rte_power_monitor(...) == -ENOTSUP
> >>>> be enough indication for that?
> >>>> So user can just do:
> >>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>           /* not supported  path */
> >>>> }
> >>>>
> >>>> To check is that feature supported or not.
> >>>
> >>>
> >>
> >> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >> we can safely make this intrinsic as a noop on other archs as well, as
> >> it's functionally identical to waking up immediately.
> >>
> >> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >
> > Sorry I don't understand what you mean, too many "it" and "this" :)
> >
>
> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> exist on other archs, this doesn't too, so it's a fairly similar
> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> equivalent to sleeping and then immediately waking up (which can happen
> for a host of reasons unrelated to the code itself).

If we are keeping the following return in the public API then it can not be NOP
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */

Also, we need to fix compilation issue if any with
http://patches.dpdk.org/patch/79540/
as it has direct reference to if
(!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
Either we need to add -ENOTSUP return or generic feature-get framework.


>
> I'm not against a generic feature-get framework, i'm just pointing out
> that if this is what's preventing the merge, it should prevent the merge
> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
>
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  9:54                           ` Jerin Jacob
@ 2020-10-09 10:03                             ` Burakov, Anatoly
  2020-10-09 10:17                               ` Thomas Monjalon
  2020-10-09 10:19                               ` Jerin Jacob
  0 siblings, 2 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 10:03 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Thomas Monjalon, Ananyev, Konstantin, David Marchand, Ma,
	Liang J, dpdk-dev, Hunt, David, Stephen Hemminger,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
>>
>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>
>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>
>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>
>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>
>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>
>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>
>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>
>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Jerin,
>>>>>>>
>>>>>>> Hi Anatoly,
>>>>>>>
>>>>>>>>
>>>>>>>>     > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>     > the consumer of this API is only supported on x86. Probably as
>>>>>>>> functions[1]
>>>>>>>>     > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>     > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>     >
>>>>>>>>     > This will help the consumer to create workers based on the
>>>>>>>> instruction features
>>>>>>>>     > which can NOT be abstracted as a generic feature across the
>>>>>>>> architectures.
>>>>>>>>
>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>
>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>
>>>>>>>
>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>
>>>>>>
>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>
>>>>> That's works too. Some means of generic probing is fine. Following
>>>>> schemed needs
>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>> new instructions like
>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>> supports it then
>>>>> the feature-get framework is good.
>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>> framework then
>>>>> we can keep the below scheme else generic arch feature is better for
>>>>> more forward
>>>>> looking use cases.
>>>>>
>>>>>> Might be just something like:
>>>>>>     rte_power_monitor(...) == -ENOTSUP
>>>>>> be enough indication for that?
>>>>>> So user can just do:
>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>            /* not supported  path */
>>>>>> }
>>>>>>
>>>>>> To check is that feature supported or not.
>>>>>
>>>>>
>>>>
>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>> it's functionally identical to waking up immediately.
>>>>
>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>
>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>
>>
>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>> exist on other archs, this doesn't too, so it's a fairly similar
>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>> equivalent to sleeping and then immediately waking up (which can happen
>> for a host of reasons unrelated to the code itself).
> 
> If we are keeping the following return in the public API then it can not be NOP
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> 

In the generic header, it is specified that return value is 
implementation-defined (i.e. arch-specific). I guess we could remove 
that and set return value to either 0 or -ENOTSUP if that would resolve 
the issue?

> Also, we need to fix compilation issue if any with
> http://patches.dpdk.org/patch/79540/
> as it has direct reference to if
> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> Either we need to add -ENOTSUP return or generic feature-get framework.

IIRC power library isn't compiled on anything other than x86, so this 
code wouldn't get compiled.

> 
> 
>>
>> I'm not against a generic feature-get framework, i'm just pointing out
>> that if this is what's preventing the merge, it should prevent the merge
>> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
>>
>> --
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:03                             ` Burakov, Anatoly
@ 2020-10-09 10:17                               ` Thomas Monjalon
  2020-10-09 10:22                                 ` Burakov, Anatoly
  2020-10-09 10:19                               ` Jerin Jacob
  1 sibling, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-09 10:17 UTC (permalink / raw)
  To: Jerin Jacob, Burakov, Anatoly
  Cc: Ananyev, Konstantin, David Marchand, Ma, Liang J, dpdk-dev, Hunt,
	David, Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

09/10/2020 12:03, Burakov, Anatoly:
> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> > On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> >> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>
> >>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>
> >>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>
> >>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>
> >>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>
> >>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jerin,
> >>>>>>>
> >>>>>>> Hi Anatoly,
> >>>>>>>
> >>>>>>>>
> >>>>>>>>     > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>     > the consumer of this API is only supported on x86. Probably as
> >>>>>>>> functions[1]
> >>>>>>>>     > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>     > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>     >
> >>>>>>>>     > This will help the consumer to create workers based on the
> >>>>>>>> instruction features
> >>>>>>>>     > which can NOT be abstracted as a generic feature across the
> >>>>>>>> architectures.
> >>>>>>>>
> >>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>
> >>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>
> >>>>>>>
> >>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>
> >>>>>>
> >>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>
> >>>>> That's works too. Some means of generic probing is fine. Following
> >>>>> schemed needs
> >>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>> new instructions like
> >>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>> supports it then
> >>>>> the feature-get framework is good.
> >>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>> framework then
> >>>>> we can keep the below scheme else generic arch feature is better for
> >>>>> more forward
> >>>>> looking use cases.
> >>>>>
> >>>>>> Might be just something like:
> >>>>>>     rte_power_monitor(...) == -ENOTSUP
> >>>>>> be enough indication for that?
> >>>>>> So user can just do:
> >>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>            /* not supported  path */
> >>>>>> }
> >>>>>>
> >>>>>> To check is that feature supported or not.
> >>>>>
> >>>>>
> >>>>
> >>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>> it's functionally identical to waking up immediately.
> >>>>
> >>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>
> >>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>
> >>
> >> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >> exist on other archs, this doesn't too, so it's a fairly similar
> >> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >> equivalent to sleeping and then immediately waking up (which can happen
> >> for a host of reasons unrelated to the code itself).
> > 
> > If we are keeping the following return in the public API then it can not be NOP
> > + * @return
> > + *   - 1 if wakeup was due to TSC timeout expiration.
> > + *   - 0 if wakeup was due to memory write or other reasons.
> > + */
> > 
> 
> In the generic header, it is specified that return value is 
> implementation-defined (i.e. arch-specific).

Obviously an API definition should *never* be "implementation-defined".


> I guess we could remove 
> that and set return value to either 0 or -ENOTSUP if that would resolve 
> the issue?
> 
> > Also, we need to fix compilation issue if any with
> > http://patches.dpdk.org/patch/79540/
> > as it has direct reference to if
> > (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> > Either we need to add -ENOTSUP return or generic feature-get framework.
> 
> IIRC power library isn't compiled on anything other than x86, so this 
> code wouldn't get compiled.

It is not call "power-x86", so we must assume it could work
on any architecture.


> >> I'm not against a generic feature-get framework, i'm just pointing out
> >> that if this is what's preventing the merge, it should prevent the merge
> >> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> >> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.

CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
that's why the expectations may be different.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:03                             ` Burakov, Anatoly
  2020-10-09 10:17                               ` Thomas Monjalon
@ 2020-10-09 10:19                               ` Jerin Jacob
  1 sibling, 0 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-10-09 10:19 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Thomas Monjalon, Ananyev, Konstantin, David Marchand, Ma,
	Liang J, dpdk-dev, Hunt, David, Stephen Hemminger,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Fri, Oct 9, 2020 at 3:33 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> > On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> >>
> >> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>>
> >>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>
> >>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>
> >>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>
> >>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>
> >>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>
> >>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jerin,
> >>>>>>>
> >>>>>>> Hi Anatoly,
> >>>>>>>
> >>>>>>>>
> >>>>>>>>     > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>     > the consumer of this API is only supported on x86. Probably as
> >>>>>>>> functions[1]
> >>>>>>>>     > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>     > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>     >
> >>>>>>>>     > This will help the consumer to create workers based on the
> >>>>>>>> instruction features
> >>>>>>>>     > which can NOT be abstracted as a generic feature across the
> >>>>>>>> architectures.
> >>>>>>>>
> >>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>
> >>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>
> >>>>>>>
> >>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>
> >>>>>>
> >>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>
> >>>>> That's works too. Some means of generic probing is fine. Following
> >>>>> schemed needs
> >>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>> new instructions like
> >>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>> supports it then
> >>>>> the feature-get framework is good.
> >>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>> framework then
> >>>>> we can keep the below scheme else generic arch feature is better for
> >>>>> more forward
> >>>>> looking use cases.
> >>>>>
> >>>>>> Might be just something like:
> >>>>>>     rte_power_monitor(...) == -ENOTSUP
> >>>>>> be enough indication for that?
> >>>>>> So user can just do:
> >>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>            /* not supported  path */
> >>>>>> }
> >>>>>>
> >>>>>> To check is that feature supported or not.
> >>>>>
> >>>>>
> >>>>
> >>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>> it's functionally identical to waking up immediately.
> >>>>
> >>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>
> >>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>
> >>
> >> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >> exist on other archs, this doesn't too, so it's a fairly similar
> >> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >> equivalent to sleeping and then immediately waking up (which can happen
> >> for a host of reasons unrelated to the code itself).
> >
> > If we are keeping the following return in the public API then it can not be NOP
> > + * @return
> > + *   - 1 if wakeup was due to TSC timeout expiration.
> > + *   - 0 if wakeup was due to memory write or other reasons.
> > + */
> >
>
> In the generic header, it is specified that return value is
> implementation-defined (i.e. arch-specific). I guess we could remove
> that and set return value to either 0 or -ENOTSUP if that would resolve
> the issue?

returning -ENOTSUP should be OK if we don't  to take generic
feature-get framework  path.

>
> > Also, we need to fix compilation issue if any with
> > http://patches.dpdk.org/patch/79540/
> > as it has direct reference to if
> > (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> > Either we need to add -ENOTSUP return or generic feature-get framework.
>
> IIRC power library isn't compiled on anything other than x86, so this
> code wouldn't get compiled.

Just checked now, librte_power compiles at least for arm64.


>
> >
> >
> >>
> >> I'm not against a generic feature-get framework, i'm just pointing out
> >> that if this is what's preventing the merge, it should prevent the merge
> >> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> >> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> >>
> >> --
> >> Thanks,
> >> Anatoly
>
>
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:17                               ` Thomas Monjalon
@ 2020-10-09 10:22                                 ` Burakov, Anatoly
  2020-10-09 10:45                                   ` Jerin Jacob
  2020-10-09 10:48                                   ` Ananyev, Konstantin
  0 siblings, 2 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 10:22 UTC (permalink / raw)
  To: Thomas Monjalon, Jerin Jacob
  Cc: Ananyev, Konstantin, David Marchand, Ma, Liang J, dpdk-dev, Hunt,
	David, Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> 09/10/2020 12:03, Burakov, Anatoly:
>> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
>>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
>>> <anatoly.burakov@intel.com> wrote:
>>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>>>
>>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>>>
>>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>>>
>>>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Jerin,
>>>>>>>>>
>>>>>>>>> Hi Anatoly,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>      > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>>>      > the consumer of this API is only supported on x86. Probably as
>>>>>>>>>> functions[1]
>>>>>>>>>>      > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>>>      > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>>>      >
>>>>>>>>>>      > This will help the consumer to create workers based on the
>>>>>>>>>> instruction features
>>>>>>>>>>      > which can NOT be abstracted as a generic feature across the
>>>>>>>>>> architectures.
>>>>>>>>>>
>>>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>>>
>>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>>>
>>>>>>>>
>>>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>>>
>>>>>>> That's works too. Some means of generic probing is fine. Following
>>>>>>> schemed needs
>>>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>>>> new instructions like
>>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>>>> supports it then
>>>>>>> the feature-get framework is good.
>>>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>>>> framework then
>>>>>>> we can keep the below scheme else generic arch feature is better for
>>>>>>> more forward
>>>>>>> looking use cases.
>>>>>>>
>>>>>>>> Might be just something like:
>>>>>>>>      rte_power_monitor(...) == -ENOTSUP
>>>>>>>> be enough indication for that?
>>>>>>>> So user can just do:
>>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>>>             /* not supported  path */
>>>>>>>> }
>>>>>>>>
>>>>>>>> To check is that feature supported or not.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>>>> it's functionally identical to waking up immediately.
>>>>>>
>>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>>>
>>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>>>
>>>>
>>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>>>> exist on other archs, this doesn't too, so it's a fairly similar
>>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>>>> equivalent to sleeping and then immediately waking up (which can happen
>>>> for a host of reasons unrelated to the code itself).
>>>
>>> If we are keeping the following return in the public API then it can not be NOP
>>> + * @return
>>> + *   - 1 if wakeup was due to TSC timeout expiration.
>>> + *   - 0 if wakeup was due to memory write or other reasons.
>>> + */
>>>
>>
>> In the generic header, it is specified that return value is
>> implementation-defined (i.e. arch-specific).
> 
> Obviously an API definition should *never* be "implementation-defined".

If there isn't a meaningful return value, we could either make it a 
void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid 
result for a UMWAIT, and there are no side-effects to the intrinsic 
itself (it's basically a fancy rte_pause).

> 
> 
>> I guess we could remove
>> that and set return value to either 0 or -ENOTSUP if that would resolve
>> the issue?
>>
>>> Also, we need to fix compilation issue if any with
>>> http://patches.dpdk.org/patch/79540/
>>> as it has direct reference to if
>>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>>> Either we need to add -ENOTSUP return or generic feature-get framework.
>>
>> IIRC power library isn't compiled on anything other than x86, so this
>> code wouldn't get compiled.
> 
> It is not call "power-x86", so we must assume it could work
> on any architecture.

#ifdef it is!

> 
> 
>>>> I'm not against a generic feature-get framework, i'm just pointing out
>>>> that if this is what's preventing the merge, it should prevent the merge
>>>> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
>>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> 
> CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
> that's why the expectations may be different.
> 

UMWAIT is a best-effort mechanism with no side-effects. It's perfectly 
legal for a UMWAIT to not sleep at all, thus rendering it effectively a 
noop. So i don't think it's all that different.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:22                                 ` Burakov, Anatoly
@ 2020-10-09 10:45                                   ` Jerin Jacob
  2020-10-09 10:48                                   ` Ananyev, Konstantin
  1 sibling, 0 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-10-09 10:45 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Thomas Monjalon, Ananyev, Konstantin, David Marchand, Ma,
	Liang J, dpdk-dev, Hunt, David, Stephen Hemminger,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Fri, Oct 9, 2020 at 3:53 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> > 09/10/2020 12:03, Burakov, Anatoly:
> >> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> >>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> >>> <anatoly.burakov@intel.com> wrote:
> >>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>>>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>>>
> >>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>>>
> >>>>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Jerin,
> >>>>>>>>>
> >>>>>>>>> Hi Anatoly,
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>      > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>>>      > the consumer of this API is only supported on x86. Probably as
> >>>>>>>>>> functions[1]
> >>>>>>>>>>      > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>>>      > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>>>      >
> >>>>>>>>>>      > This will help the consumer to create workers based on the
> >>>>>>>>>> instruction features
> >>>>>>>>>>      > which can NOT be abstracted as a generic feature across the
> >>>>>>>>>> architectures.
> >>>>>>>>>>
> >>>>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>>>
> >>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>>>
> >>>>>>> That's works too. Some means of generic probing is fine. Following
> >>>>>>> schemed needs
> >>>>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>>>> new instructions like
> >>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>>>> supports it then
> >>>>>>> the feature-get framework is good.
> >>>>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>>>> framework then
> >>>>>>> we can keep the below scheme else generic arch feature is better for
> >>>>>>> more forward
> >>>>>>> looking use cases.
> >>>>>>>
> >>>>>>>> Might be just something like:
> >>>>>>>>      rte_power_monitor(...) == -ENOTSUP
> >>>>>>>> be enough indication for that?
> >>>>>>>> So user can just do:
> >>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>>>             /* not supported  path */
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> To check is that feature supported or not.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>>>> it's functionally identical to waking up immediately.
> >>>>>>
> >>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>>>
> >>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>>>
> >>>>
> >>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >>>> exist on other archs, this doesn't too, so it's a fairly similar
> >>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >>>> equivalent to sleeping and then immediately waking up (which can happen
> >>>> for a host of reasons unrelated to the code itself).
> >>>
> >>> If we are keeping the following return in the public API then it can not be NOP
> >>> + * @return
> >>> + *   - 1 if wakeup was due to TSC timeout expiration.
> >>> + *   - 0 if wakeup was due to memory write or other reasons.
> >>> + */
> >>>
> >>
> >> In the generic header, it is specified that return value is
> >> implementation-defined (i.e. arch-specific).
> >
> > Obviously an API definition should *never* be "implementation-defined".
>
> If there isn't a meaningful return value, we could either make it a
> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
> result for a UMWAIT, and there are no side-effects to the intrinsic
> itself (it's basically a fancy rte_pause).
>
> >
> >
> >> I guess we could remove
> >> that and set return value to either 0 or -ENOTSUP if that would resolve
> >> the issue?
> >>
> >>> Also, we need to fix compilation issue if any with
> >>> http://patches.dpdk.org/patch/79540/
> >>> as it has direct reference to if
> >>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> >>> Either we need to add -ENOTSUP return or generic feature-get framework.
> >>
> >> IIRC power library isn't compiled on anything other than x86, so this
> >> code wouldn't get compiled.
> >
> > It is not call "power-x86", so we must assume it could work
> > on any architecture.
>
> #ifdef it is!
>
> >
> >
> >>>> I'm not against a generic feature-get framework, i'm just pointing out
> >>>> that if this is what's preventing the merge, it should prevent the merge
> >>>> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> >>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> >
> > CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
> > that's why the expectations may be different.
> >
>
> UMWAIT is a best-effort mechanism with no side-effects. It's perfectly
> legal for a UMWAIT to not sleep at all, thus rendering it effectively a
> noop. So i don't think it's all that different.

If a platform does not support UMWAIT in ALL case IMO, no consumer takes this
the path for power saving. So IMO, t is different than CLDEMOTE

>
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:22                                 ` Burakov, Anatoly
  2020-10-09 10:45                                   ` Jerin Jacob
@ 2020-10-09 10:48                                   ` Ananyev, Konstantin
  2020-10-09 11:12                                     ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 10:48 UTC (permalink / raw)
  To: Burakov, Anatoly, Thomas Monjalon, Jerin Jacob
  Cc: David Marchand, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> > 09/10/2020 12:03, Burakov, Anatoly:
> >> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> >>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> >>> <anatoly.burakov@intel.com> wrote:
> >>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>>>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>>>
> >>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>>>
> >>>>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Jerin,
> >>>>>>>>>
> >>>>>>>>> Hi Anatoly,
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>      > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>>>      > the consumer of this API is only supported on x86. Probably as
> >>>>>>>>>> functions[1]
> >>>>>>>>>>      > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>>>      > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>>>      >
> >>>>>>>>>>      > This will help the consumer to create workers based on the
> >>>>>>>>>> instruction features
> >>>>>>>>>>      > which can NOT be abstracted as a generic feature across the
> >>>>>>>>>> architectures.
> >>>>>>>>>>
> >>>>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>>>
> >>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>>>
> >>>>>>> That's works too. Some means of generic probing is fine. Following
> >>>>>>> schemed needs
> >>>>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>>>> new instructions like
> >>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>>>> supports it then
> >>>>>>> the feature-get framework is good.
> >>>>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>>>> framework then
> >>>>>>> we can keep the below scheme else generic arch feature is better for
> >>>>>>> more forward
> >>>>>>> looking use cases.
> >>>>>>>
> >>>>>>>> Might be just something like:
> >>>>>>>>      rte_power_monitor(...) == -ENOTSUP
> >>>>>>>> be enough indication for that?
> >>>>>>>> So user can just do:
> >>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>>>             /* not supported  path */
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> To check is that feature supported or not.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>>>> it's functionally identical to waking up immediately.
> >>>>>>
> >>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>>>
> >>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>>>
> >>>>
> >>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >>>> exist on other archs, this doesn't too, so it's a fairly similar
> >>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >>>> equivalent to sleeping and then immediately waking up (which can happen
> >>>> for a host of reasons unrelated to the code itself).
> >>>
> >>> If we are keeping the following return in the public API then it can not be NOP
> >>> + * @return
> >>> + *   - 1 if wakeup was due to TSC timeout expiration.
> >>> + *   - 0 if wakeup was due to memory write or other reasons.
> >>> + */
> >>>
> >>
> >> In the generic header, it is specified that return value is
> >> implementation-defined (i.e. arch-specific).
> >
> > Obviously an API definition should *never* be "implementation-defined".
> 
> If there isn't a meaningful return value, we could either make it a
> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
> result for a UMWAIT, and there are no side-effects to the intrinsic
> itself (it's basically a fancy rte_pause).
> 
> >
> >
> >> I guess we could remove
> >> that and set return value to either 0 or -ENOTSUP if that would resolve
> >> the issue?
> >>
> >>> Also, we need to fix compilation issue if any with
> >>> http://patches.dpdk.org/patch/79540/
> >>> as it has direct reference to if
> >>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> >>> Either we need to add -ENOTSUP return or generic feature-get framework.
> >>
> >> IIRC power library isn't compiled on anything other than x86, so this
> >> code wouldn't get compiled.
> >
> > It is not call "power-x86", so we must assume it could work
> > on any architecture.
> 
> #ifdef it is!
> 
> >
> >
> >>>> I'm not against a generic feature-get framework, i'm just pointing out
> >>>> that if this is what's preventing the merge, it should prevent the merge
> >>>> of CLDEMOTE as well

I wouldn't consider these two as totally equal.
Yes, both are just hints to CPU, that can be ignored,
but if not, then consequences of executing are quite different.
If UMWAIT is not supported by cpu at all, then user might want to use some  
different power saving mechanism (pause, frequence scaling, etc.).
Without information is UMWAIT supported or not, user can't make
such choice.  

>, yet Jerin has acked that one and has explicitly
> >>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> >
> > CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
> > that's why the expectations may be different.
> >
> 
> UMWAIT is a best-effort mechanism with no side-effects. It's perfectly
> legal for a UMWAIT to not sleep at all, thus rendering it effectively a
> noop. So i don't think it's all that different.





^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:48                                   ` Ananyev, Konstantin
@ 2020-10-09 11:12                                     ` Burakov, Anatoly
  2020-10-09 11:36                                       ` Bruce Richardson
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 11:12 UTC (permalink / raw)
  To: Ananyev, Konstantin, Thomas Monjalon, Jerin Jacob
  Cc: David Marchand, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 11:48 AM, Ananyev, Konstantin wrote:
>> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
>>> 09/10/2020 12:03, Burakov, Anatoly:
>>>> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
>>>>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>>>>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Jerin,
>>>>>>>>>>>
>>>>>>>>>>> Hi Anatoly,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>       > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>>>>>       > the consumer of this API is only supported on x86. Probably as
>>>>>>>>>>>> functions[1]
>>>>>>>>>>>>       > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>>>>>       > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>>>>>       >
>>>>>>>>>>>>       > This will help the consumer to create workers based on the
>>>>>>>>>>>> instruction features
>>>>>>>>>>>>       > which can NOT be abstracted as a generic feature across the
>>>>>>>>>>>> architectures.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>>>>>
>>>>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>>>>>
>>>>>>>>> That's works too. Some means of generic probing is fine. Following
>>>>>>>>> schemed needs
>>>>>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>>>>>> new instructions like
>>>>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>>>>>> supports it then
>>>>>>>>> the feature-get framework is good.
>>>>>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>>>>>> framework then
>>>>>>>>> we can keep the below scheme else generic arch feature is better for
>>>>>>>>> more forward
>>>>>>>>> looking use cases.
>>>>>>>>>
>>>>>>>>>> Might be just something like:
>>>>>>>>>>       rte_power_monitor(...) == -ENOTSUP
>>>>>>>>>> be enough indication for that?
>>>>>>>>>> So user can just do:
>>>>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>>>>>              /* not supported  path */
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> To check is that feature supported or not.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>>>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>>>>>> it's functionally identical to waking up immediately.
>>>>>>>>
>>>>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>>>>>
>>>>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>>>>>
>>>>>>
>>>>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>>>>>> exist on other archs, this doesn't too, so it's a fairly similar
>>>>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>>>>>> equivalent to sleeping and then immediately waking up (which can happen
>>>>>> for a host of reasons unrelated to the code itself).
>>>>>
>>>>> If we are keeping the following return in the public API then it can not be NOP
>>>>> + * @return
>>>>> + *   - 1 if wakeup was due to TSC timeout expiration.
>>>>> + *   - 0 if wakeup was due to memory write or other reasons.
>>>>> + */
>>>>>
>>>>
>>>> In the generic header, it is specified that return value is
>>>> implementation-defined (i.e. arch-specific).
>>>
>>> Obviously an API definition should *never* be "implementation-defined".
>>
>> If there isn't a meaningful return value, we could either make it a
>> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
>> result for a UMWAIT, and there are no side-effects to the intrinsic
>> itself (it's basically a fancy rte_pause).
>>
>>>
>>>
>>>> I guess we could remove
>>>> that and set return value to either 0 or -ENOTSUP if that would resolve
>>>> the issue?
>>>>
>>>>> Also, we need to fix compilation issue if any with
>>>>> http://patches.dpdk.org/patch/79540/
>>>>> as it has direct reference to if
>>>>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>>>>> Either we need to add -ENOTSUP return or generic feature-get framework.
>>>>
>>>> IIRC power library isn't compiled on anything other than x86, so this
>>>> code wouldn't get compiled.
>>>
>>> It is not call "power-x86", so we must assume it could work
>>> on any architecture.
>>
>> #ifdef it is!
>>
>>>
>>>
>>>>>> I'm not against a generic feature-get framework, i'm just pointing out
>>>>>> that if this is what's preventing the merge, it should prevent the merge
>>>>>> of CLDEMOTE as well
> 
> I wouldn't consider these two as totally equal.
> Yes, both are just hints to CPU, that can be ignored,
> but if not, then consequences of executing are quite different.
> If UMWAIT is not supported by cpu at all, then user might want to use some
> different power saving mechanism (pause, frequence scaling, etc.).
> Without information is UMWAIT supported or not, user can't make
> such choice.

After some attempts at implementing this, i actually came to the 
conclusion that some generic way to check support for this feature is 
necessary, because we end up with a usability inconsistency:

1) on non-x86, if you call the function, it returns -ENOTSUP
2) on x86, since we're not checking CPUID flags on every single call, 
it'll either succeed, or crash the process - the burden is on the user 
to check for CPUID flags, but it can't be done in an arch agnostic way 
because the CPUID flags are only defined for x86, thus requiring a 
special code path for x86

Where would be the best place to add such an infrastructure? I'm 
thinking rte_cpuflags.h?

> 
>> , yet Jerin has acked that one and has explicitly
>>>>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
>>>
>>> CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
>>> that's why the expectations may be different.
>>>
>>
>> UMWAIT is a best-effort mechanism with no side-effects. It's perfectly
>> legal for a UMWAIT to not sleep at all, thus rendering it effectively a
>> noop. So i don't think it's all that different.
> 
> 
> 
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 11:12                                     ` Burakov, Anatoly
@ 2020-10-09 11:36                                       ` Bruce Richardson
  2020-10-09 11:42                                         ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Bruce Richardson @ 2020-10-09 11:36 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Ananyev, Konstantin, Thomas Monjalon, Jerin Jacob,
	David Marchand, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Fri, Oct 09, 2020 at 12:12:56PM +0100, Burakov, Anatoly wrote:
> On 09-Oct-20 11:48 AM, Ananyev, Konstantin wrote:
> > > On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> > > > 09/10/2020 12:03, Burakov, Anatoly:
> > > > > On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> > > > > > On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> > > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > > On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> > > > > > > > 09/10/2020 11:25, Burakov, Anatoly:
> > > > > > > > > On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> > > > > > > > > > On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> > > > > > > > > > <konstantin.ananyev@intel.com> wrote:
> > > > > > > > > > > > On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> > > > > > > > > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > > > > > > > > On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > > > > > > > > > > > > > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Add two new power management intrinsics, and provide an implementation
> > > > > > > > > > > > > > > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > > > > > > > > > > > > > > are implemented as raw byte opcodes because there is not yet widespread
> > > > > > > > > > > > > > > > compiler support for these instructions.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > The power management instructions provide an architecture-specific
> > > > > > > > > > > > > > > > function to either wait until a specified TSC timestamp is reached, or
> > > > > > > > > > > > > > > > optionally wait until either a TSC timestamp is reached or a memory
> > > > > > > > > > > > > > > > location is written to. The monitor function also provides an optional
> > > > > > > > > > > > > > > > comparison, to avoid sleeping when the expected write has already
> > > > > > > > > > > > > > > > happened, and no more writes are expected.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > For more details, Please reference Intel SDM Volume 2.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I really would like to see feedbacks from other arch maintainers.
> > > > > > > > > > > > > > > Unfortunately they were not Cc'ed.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > > > > > > > > > > > > > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Also please mark the new functions as experimental.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hi Jerin,
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Anatoly,
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > >       > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> > > > > > > > > > > > >       > the consumer of this API is only supported on x86. Probably as
> > > > > > > > > > > > > functions[1]
> > > > > > > > > > > > >       > or macro flags scheme and have a stub for the other architectures as the
> > > > > > > > > > > > >       > API marked as generic ie rte_power_* not rte_x86_..
> > > > > > > > > > > > >       >
> > > > > > > > > > > > >       > This will help the consumer to create workers based on the
> > > > > > > > > > > > > instruction features
> > > > > > > > > > > > >       > which can NOT be abstracted as a generic feature across the
> > > > > > > > > > > > > architectures.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I'm not entirely sure what you mean by that.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I mean, yes, we should have added stubs for other architectures, and we
> > > > > > > > > > > > > will add those in future revisions, but what does your proposed runtime
> > > > > > > > > > > > > check accomplish that cannot currently be done with CPUID flags?
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> > > > > > > > > > > > i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> > > > > > > > > > > > and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> > > > > > > > > > > > I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > I am agree with Jerin, that we need some generic way to
> > > > > > > > > > > figure-out does platform supports power_monitor() or not.
> > > > > > > > > > > Though not sure do we need to create a new feature-get framework here...
> > > > > > > > > > 
> > > > > > > > > > That's works too. Some means of generic probing is fine. Following
> > > > > > > > > > schemed needs
> > > > > > > > > > more documentation on that usage, as, it is not straight forward compare to
> > > > > > > > > > feature-get framework. Also, on the other thread, we are adding the
> > > > > > > > > > new instructions like
> > > > > > > > > > demote cacheline etc, maybe if the user wants to KNOW if the arch
> > > > > > > > > > supports it then
> > > > > > > > > > the feature-get framework is good.
> > > > > > > > > > If we think, there is no other usecase for generic arch feature-get
> > > > > > > > > > framework then
> > > > > > > > > > we can keep the below scheme else generic arch feature is better for
> > > > > > > > > > more forward
> > > > > > > > > > looking use cases.
> > > > > > > > > > 
> > > > > > > > > > > Might be just something like:
> > > > > > > > > > >       rte_power_monitor(...) == -ENOTSUP
> > > > > > > > > > > be enough indication for that?
> > > > > > > > > > > So user can just do:
> > > > > > > > > > > if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> > > > > > > > > > >              /* not supported  path */
> > > > > > > > > > > }
> > > > > > > > > > > 
> > > > > > > > > > > To check is that feature supported or not.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> > > > > > > > > we can safely make this intrinsic as a noop on other archs as well, as
> > > > > > > > > it's functionally identical to waking up immediately.
> > > > > > > > > 
> > > > > > > > > If we're not creating this for CLDEMOTE, we don't need it here as well.
> > > > > > > > > If we do need it for this, then we arguably need it for CLDEMOTE too.
> > > > > > > > 
> > > > > > > > Sorry I don't understand what you mean, too many "it" and "this" :)
> > > > > > > > 
> > > > > > > 
> > > > > > > Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> > > > > > > exist on other archs, this doesn't too, so it's a fairly similar
> > > > > > > situation. Stubbing UMWAIT with a noop is a valid approach because it's
> > > > > > > equivalent to sleeping and then immediately waking up (which can happen
> > > > > > > for a host of reasons unrelated to the code itself).
> > > > > > 
> > > > > > If we are keeping the following return in the public API then it can not be NOP
> > > > > > + * @return
> > > > > > + *   - 1 if wakeup was due to TSC timeout expiration.
> > > > > > + *   - 0 if wakeup was due to memory write or other reasons.
> > > > > > + */
> > > > > > 
> > > > > 
> > > > > In the generic header, it is specified that return value is
> > > > > implementation-defined (i.e. arch-specific).
> > > > 
> > > > Obviously an API definition should *never* be "implementation-defined".
> > > 
> > > If there isn't a meaningful return value, we could either make it a
> > > void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
> > > result for a UMWAIT, and there are no side-effects to the intrinsic
> > > itself (it's basically a fancy rte_pause).
> > > 
> > > > 
> > > > 
> > > > > I guess we could remove
> > > > > that and set return value to either 0 or -ENOTSUP if that would resolve
> > > > > the issue?
> > > > > 
> > > > > > Also, we need to fix compilation issue if any with
> > > > > > http://patches.dpdk.org/patch/79540/
> > > > > > as it has direct reference to if
> > > > > > (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> > > > > > Either we need to add -ENOTSUP return or generic feature-get framework.
> > > > > 
> > > > > IIRC power library isn't compiled on anything other than x86, so this
> > > > > code wouldn't get compiled.
> > > > 
> > > > It is not call "power-x86", so we must assume it could work
> > > > on any architecture.
> > > 
> > > #ifdef it is!
> > > 
> > > > 
> > > > 
> > > > > > > I'm not against a generic feature-get framework, i'm just pointing out
> > > > > > > that if this is what's preventing the merge, it should prevent the merge
> > > > > > > of CLDEMOTE as well
> > 
> > I wouldn't consider these two as totally equal.
> > Yes, both are just hints to CPU, that can be ignored,
> > but if not, then consequences of executing are quite different.
> > If UMWAIT is not supported by cpu at all, then user might want to use some
> > different power saving mechanism (pause, frequence scaling, etc.).
> > Without information is UMWAIT supported or not, user can't make
> > such choice.
> 
> After some attempts at implementing this, i actually came to the conclusion
> that some generic way to check support for this feature is necessary,
> because we end up with a usability inconsistency:
> 
> 1) on non-x86, if you call the function, it returns -ENOTSUP
> 2) on x86, since we're not checking CPUID flags on every single call, it'll
> either succeed, or crash the process - the burden is on the user to check
> for CPUID flags, but it can't be done in an arch agnostic way because the
> CPUID flags are only defined for x86, thus requiring a special code path for
> x86
> 
> Where would be the best place to add such an infrastructure? I'm thinking
> rte_cpuflags.h?
> 
Time to relook at some of the contents of patchset:
http://patches.dpdk.org/project/dpdk/list/?series=4811&archive=both&state=*

The idea of that set (IIRC) was to replace the per-architecture enums with
just strings to avoid situations like this - or at least make them less
awkward.

/Bruce

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 11:36                                       ` Bruce Richardson
@ 2020-10-09 11:42                                         ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 11:42 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Ananyev, Konstantin, Thomas Monjalon, Jerin Jacob,
	David Marchand, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 12:36 PM, Bruce Richardson wrote:
> On Fri, Oct 09, 2020 at 12:12:56PM +0100, Burakov, Anatoly wrote:
>> On 09-Oct-20 11:48 AM, Ananyev, Konstantin wrote:
>>>> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
>>>>> 09/10/2020 12:03, Burakov, Anatoly:
>>>>>> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
>>>>>>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>>>>>>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>>>>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>>>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>>>>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Jerin,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Anatoly,
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>>>>>>>        > the consumer of this API is only supported on x86. Probably as
>>>>>>>>>>>>>> functions[1]
>>>>>>>>>>>>>>        > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>>>>>>>        > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>>>>>>>        >
>>>>>>>>>>>>>>        > This will help the consumer to create workers based on the
>>>>>>>>>>>>>> instruction features
>>>>>>>>>>>>>>        > which can NOT be abstracted as a generic feature across the
>>>>>>>>>>>>>> architectures.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>>>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>>>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>>>>>>>
>>>>>>>>>>> That's works too. Some means of generic probing is fine. Following
>>>>>>>>>>> schemed needs
>>>>>>>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>>>>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>>>>>>>> new instructions like
>>>>>>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>>>>>>>> supports it then
>>>>>>>>>>> the feature-get framework is good.
>>>>>>>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>>>>>>>> framework then
>>>>>>>>>>> we can keep the below scheme else generic arch feature is better for
>>>>>>>>>>> more forward
>>>>>>>>>>> looking use cases.
>>>>>>>>>>>
>>>>>>>>>>>> Might be just something like:
>>>>>>>>>>>>        rte_power_monitor(...) == -ENOTSUP
>>>>>>>>>>>> be enough indication for that?
>>>>>>>>>>>> So user can just do:
>>>>>>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>>>>>>>               /* not supported  path */
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> To check is that feature supported or not.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>>>>>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>>>>>>>> it's functionally identical to waking up immediately.
>>>>>>>>>>
>>>>>>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>>>>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>>>>>>>
>>>>>>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>>>>>>>
>>>>>>>>
>>>>>>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>>>>>>>> exist on other archs, this doesn't too, so it's a fairly similar
>>>>>>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>>>>>>>> equivalent to sleeping and then immediately waking up (which can happen
>>>>>>>> for a host of reasons unrelated to the code itself).
>>>>>>>
>>>>>>> If we are keeping the following return in the public API then it can not be NOP
>>>>>>> + * @return
>>>>>>> + *   - 1 if wakeup was due to TSC timeout expiration.
>>>>>>> + *   - 0 if wakeup was due to memory write or other reasons.
>>>>>>> + */
>>>>>>>
>>>>>>
>>>>>> In the generic header, it is specified that return value is
>>>>>> implementation-defined (i.e. arch-specific).
>>>>>
>>>>> Obviously an API definition should *never* be "implementation-defined".
>>>>
>>>> If there isn't a meaningful return value, we could either make it a
>>>> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
>>>> result for a UMWAIT, and there are no side-effects to the intrinsic
>>>> itself (it's basically a fancy rte_pause).
>>>>
>>>>>
>>>>>
>>>>>> I guess we could remove
>>>>>> that and set return value to either 0 or -ENOTSUP if that would resolve
>>>>>> the issue?
>>>>>>
>>>>>>> Also, we need to fix compilation issue if any with
>>>>>>> http://patches.dpdk.org/patch/79540/
>>>>>>> as it has direct reference to if
>>>>>>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>>>>>>> Either we need to add -ENOTSUP return or generic feature-get framework.
>>>>>>
>>>>>> IIRC power library isn't compiled on anything other than x86, so this
>>>>>> code wouldn't get compiled.
>>>>>
>>>>> It is not call "power-x86", so we must assume it could work
>>>>> on any architecture.
>>>>
>>>> #ifdef it is!
>>>>
>>>>>
>>>>>
>>>>>>>> I'm not against a generic feature-get framework, i'm just pointing out
>>>>>>>> that if this is what's preventing the merge, it should prevent the merge
>>>>>>>> of CLDEMOTE as well
>>>
>>> I wouldn't consider these two as totally equal.
>>> Yes, both are just hints to CPU, that can be ignored,
>>> but if not, then consequences of executing are quite different.
>>> If UMWAIT is not supported by cpu at all, then user might want to use some
>>> different power saving mechanism (pause, frequence scaling, etc.).
>>> Without information is UMWAIT supported or not, user can't make
>>> such choice.
>>
>> After some attempts at implementing this, i actually came to the conclusion
>> that some generic way to check support for this feature is necessary,
>> because we end up with a usability inconsistency:
>>
>> 1) on non-x86, if you call the function, it returns -ENOTSUP
>> 2) on x86, since we're not checking CPUID flags on every single call, it'll
>> either succeed, or crash the process - the burden is on the user to check
>> for CPUID flags, but it can't be done in an arch agnostic way because the
>> CPUID flags are only defined for x86, thus requiring a special code path for
>> x86
>>
>> Where would be the best place to add such an infrastructure? I'm thinking
>> rte_cpuflags.h?
>>
> Time to relook at some of the contents of patchset:
> http://patches.dpdk.org/project/dpdk/list/?series=4811&archive=both&state=*
> 
> The idea of that set (IIRC) was to replace the per-architecture enums with
> just strings to avoid situations like this - or at least make them less
> awkward.
> 
> /Bruce
> 

Yes, that patchset looks like it would work nicely in this case. If 
there is consensus to resurrect it and make it work, i'll drop the 
"generic feature get" thing, but for now i'll continue working on it - 
easier to remove code than to add it :D

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  9:11           ` Burakov, Anatoly
@ 2020-10-09 15:39             ` Ananyev, Konstantin
  2020-10-09 16:10               ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 15:39 UTC (permalink / raw)
  To: Burakov, Anatoly, Ma, Liang J, dev; +Cc: Hunt, David, stephen


> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
> >>
> >> Add two new power management intrinsics, and provide an implementation
> >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >> are implemented as raw byte opcodes because there is not yet widespread
> >> compiler support for these instructions.
> >>
> >> The power management instructions provide an architecture-specific
> >> function to either wait until a specified TSC timestamp is reached, or
> >> optionally wait until either a TSC timestamp is reached or a memory
> >> location is written to. The monitor function also provides an optional
> >> comparison, to avoid sleeping when the expected write has already
> >> happened, and no more writes are expected.
> >
> > I think what this API is missing - a function to wakeup sleeping core.
> > If user can/should use some system call to achieve that, then at least
> > it has to be clearly documented, even better some wrapper provided.
> 
> I don't think it's possible to do that without severely overcomplicating
> the intrinsic and its usage, because AFAIK the only way to wake up a
> sleeping core would be to send some kind of interrupt to the core, or
> trigger a write to the cache-line in question.
> 

Yes, I think we either need a syscall that would do an IPI for us
(on top of my head - membarrier() does that, might be there are some other syscalls too),
or something hand-made. For hand-made, I wonder would something like that
be safe and sufficient:
uint64_t val = atomic_load(addr);
CAS(addr, val, &val);
?
Anyway, one way or another - I think ability to wakeup core we put to sleep
have to be an essential part of this feature. 
As I understand linux kernel will limit max amount of sleep time for these instructions:
https://lwn.net/Articles/790920/
But relying just on that, seems too vague for me:
- user can adjust that value
- wouldn't apply to older kernels and non-linux cases
Konstantin






^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API Liang Ma
@ 2020-10-09 15:53         ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 15:53 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly

> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's
> status bit.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 06/10] net/i40e: implement power management API
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 06/10] net/i40e: " Liang Ma
@ 2020-10-09 16:01         ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 16:01 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly

> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's
> status bit.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (10 preceding siblings ...)
  2020-10-08 22:08       ` Ananyev, Konstantin
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-09 17:06         ` Burakov, Anatoly
                           ` (10 more replies)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics Anatoly Burakov
                         ` (8 subsequent siblings)
  20 siblings, 11 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Bruce Richardson, Konstantin Ananyev, david.hunt,
	jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add new x86 cpuid support for WAITPKG.
This flag indicate processor support umwait/umonitor/tpause
instruction.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_eal/x86/include/rte_cpuflags.h | 2 ++
 lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..5041a830a7 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	/**< UMWAIT/TPAUSE Instructions */
+	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (11 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-09 16:09         ` Jerin Jacob
  2020-10-12 19:47         ` David Christensen
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
                         ` (7 subsequent siblings)
  20 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jan Viktorin, Ruifeng Wang, David Christensen,
	Bruce Richardson, Konstantin Ananyev, david.hunt, jerinjacobk,
	thomas, timothy.mcdaniel, gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

For more details, please refer to Intel(R) 64 and IA-32 Architectures
Software Developer's Manual, Volume 2.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Removed return values
    - Simplified intrinsics and hardcoded C0.2 state
    - Added other arch stubs

 lib/librte_eal/arm/include/meson.build        |   1 +
 .../arm/include/rte_power_intrinsics.h        |  62 ++++++++++
 .../include/generic/rte_power_intrinsics.h    |  61 ++++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/ppc/include/meson.build        |   1 +
 .../ppc/include/rte_power_intrinsics.h        |  62 ++++++++++
 lib/librte_eal/x86/include/meson.build        |   1 +
 .../x86/include/rte_power_intrinsics.h        | 106 ++++++++++++++++++
 8 files changed, 295 insertions(+)
 create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
index 73b750a18f..c6a9f70d73 100644
--- a/lib/librte_eal/arm/include/meson.build
+++ b/lib/librte_eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
 	'rte_pause_32.h',
 	'rte_pause_64.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch_32.h',
 	'rte_prefetch_64.h',
 	'rte_prefetch.h',
diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..4aad44a0b9
--- /dev/null
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_ARM_H_
+#define _RTE_POWER_INTRINSIC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 0 on success
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+}
+
+/**
+ * This function is not supported on ARM.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..e36c1f8976
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   - 0 on success
+ *   - -ENOTSUP if not supported
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
index ab4bd28092..0873b2aecb 100644
--- a/lib/librte_eal/ppc/include/meson.build
+++ b/lib/librte_eal/ppc/include/meson.build
@@ -10,6 +10,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch.h',
 	'rte_rwlock.h',
 	'rte_spinlock.h',
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..70fd7b094f
--- /dev/null
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_PPC_H_
+#define _RTE_POWER_INTRINSIC_PPC_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 0 on success
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+}
+
+/**
+ * This function is not supported on PPC64.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..8d579eaf64
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 0 on success
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+		: /* ignore rflags */
+		: "D"(0), /* enter C0.2 */
+		  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+		     : /* ignore rflags */
+		     : "D"(0), /* enter C0.2 */
+		       "a"(tsc_l), "d"(tsc_h));
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (12 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-11 10:07         ` Jerin Jacob
  2020-10-12 19:52         ` David Christensen
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API Anatoly Burakov
                         ` (6 subsequent siblings)
  20 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Ray Kinsella,
	Neil Horman, Bruce Richardson, Konstantin Ananyev, david.hunt,
	liang.j.ma, jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

Currently, it is not possible to check support for intrinsics that
are platform-specific, cannot be abstracted in a generic way, or do not
have support on all architectures. The CPUID flags can be used to some
extent, but they are only defined for their platform, while intrinsics
will be available to all code as they are in generic headers.

This patch introduces infrastructure to check support for certain
platform-specific intrinsics, and adds support for checking support for
IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../arm/include/rte_power_intrinsics.h        |  8 ++++++
 lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
 lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
 .../include/generic/rte_power_intrinsics.h    |  8 ++++++
 .../ppc/include/rte_power_intrinsics.h        |  8 ++++++
 lib/librte_eal/ppc/rte_cpuflags.c             |  6 +++++
 lib/librte_eal/rte_eal_version.map            |  1 +
 .../x86/include/rte_power_intrinsics.h        |  8 ++++++
 lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
 9 files changed, 83 insertions(+)

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index 4aad44a0b9..055ec5877a 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -17,6 +17,10 @@ extern "C" {
 /**
  * This function is not supported on ARM.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
  * @param expected_value
@@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
 /**
  * This function is not supported on ARM.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for.
  *
diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
index caf3dc83a5..7eef11fa02 100644
--- a/lib/librte_eal/arm/rte_cpuflags.c
+++ b/lib/librte_eal/arm/rte_cpuflags.c
@@ -138,3 +138,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
index 872f0ebe3e..28a5aecde8 100644
--- a/lib/librte_eal/include/generic/rte_cpuflags.h
+++ b/lib/librte_eal/include/generic/rte_cpuflags.h
@@ -13,6 +13,32 @@
 #include "rte_common.h"
 #include <errno.h>
 
+#include <rte_compat.h>
+
+/**
+ * Structure used to describe platform-specific intrinsics that may or may not
+ * be supported at runtime.
+ */
+struct rte_cpu_intrinsics {
+	uint32_t power_monitor : 1;
+	/**< indicates support for rte_power_monitor function */
+	uint32_t power_pause : 1;
+	/**< indicates support for rte_power_pause function */
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Check CPU support for various intrinsics at runtime.
+ *
+ * @param intrinsics
+ *     Pointer to a structure to be filled.
+ */
+__rte_experimental
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
+
 /**
  * Enumeration of all CPU features supported
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index e36c1f8976..218eda7e86 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -26,6 +26,10 @@
  * checked against the expected value, and if they match, the entering of
  * optimized power state may be aborted.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
  * @param expected_value
@@ -49,6 +53,10 @@ static inline void rte_power_monitor(const volatile void *p,
  * Enter an architecture-defined optimized power state until a certain TSC
  * timestamp is reached.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 70fd7b094f..d63ad86849 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -17,6 +17,10 @@ extern "C" {
 /**
  * This function is not supported on PPC64.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
  * @param expected_value
@@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
 /**
  * This function is not supported on PPC64.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for.
  *
diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
index 3bb7563ce9..eee8234384 100644
--- a/lib/librte_eal/ppc/rte_cpuflags.c
+++ b/lib/librte_eal/ppc/rte_cpuflags.c
@@ -108,3 +108,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index a93dea9fe6..ed944f2bd4 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -400,6 +400,7 @@ EXPERIMENTAL {
 	# added in 20.11
 	__rte_eal_trace_generic_size_t;
 	rte_service_lcore_may_be_active;
+	rte_cpu_get_intrinsics_support;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
index 8d579eaf64..3afc165a1f 100644
--- a/lib/librte_eal/x86/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -29,6 +29,10 @@ extern "C" {
  * For more information about usage of these instructions, please refer to
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
  * @param expected_value
@@ -80,6 +84,10 @@ static inline void rte_power_monitor(const volatile void *p,
  * information about usage of this instruction, please refer to Intel(R) 64 and
  * IA-32 Architectures Software Developer's Manual.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for.
  *
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 0325c4b93b..a96312ff7f 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -7,6 +7,7 @@
 #include <stdio.h>
 #include <errno.h>
 #include <stdint.h>
+#include <string.h>
 
 #include "rte_cpuid.h"
 
@@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
+		intrinsics->power_monitor = 1;
+		intrinsics->power_pause = 1;
+	}
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (13 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-14  3:10         ` Guo, Jia
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 05/10] power: add PMD power management API and callback Anatoly Burakov
                         ` (5 subsequent siblings)
  20 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple API to allow getting address of next RX descriptor from the
PMD, as well as release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Bring function format in line with other functions in the file
    - Ensure the API is supported by the driver before calling it (Konstantin)

 doc/guides/rel_notes/release_20_11.rst   | 16 ++++++++++++++
 lib/librte_ethdev/rte_ethdev.c           | 17 ++++++++++++++
 lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_version.map |  1 +
 5 files changed, 86 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 808bdc4e54..e85af5d3e9 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**
+
+  * ``rte_eth_get_wake_addr()``
+  * add new eth_dev_ops ``get_wake_addr``
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
@@ -136,6 +141,17 @@ New Features
   * Extern objects and functions can be plugged into the pipeline.
   * Transaction-oriented table updates.
 
+* **Add PMD power management mechanism**
+
+  3 new Ethernet PMD power management mechanism is added through existing
+  RX callback infrastructure.
+
+  * Add power saving scheme based on UMWAIT instruction (x86 only)
+  * Add power saving scheme based on ``rte_pause()``
+  * Add power saving scheme based on frequency scaling through the power library
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 48d1333b17..352108f43c 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -4804,6 +4804,23 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -ENOTSUP);
+
+	return eth_err(port_id,
+		dev->dev_ops->get_wake_addr(dev->data->rx_queues[queue_id],
+			wake_addr, expected, mask));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index d2bf74f128..a6cfe3cd57 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4014,6 +4014,30 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * Retrieve the wake up address from specific queue
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Tx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param wake_addr
+ *   The pointer point to the address which is used for monitoring.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ *
+ * @return
+ *   - 0: Success.
+ *   -EINVAL: Failed to get wake address.
+ */
+__rte_experimental
+int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+			  volatile void **wake_addr,
+			  uint64_t *expected, uint64_t *mask);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index c3062c246c..935d46f25c 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the Wake up address.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param tail_desc_addr
+ *   The pointer point to descriptor address var.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_get_wake_addr_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -713,6 +738,9 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_get_wake_addr_t get_wake_addr;
+	/**< Get wake up address. */
+
 };
 
 /**
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index c95ef5157a..3cb2093980 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -229,6 +229,7 @@ EXPERIMENTAL {
 	# added in 20.11
 	rte_eth_link_speed_to_str;
 	rte_eth_link_to_str;
+	rte_eth_get_wake_addr;
 };
 
 INTERNAL {
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 05/10] power: add PMD power management API and callback
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (14 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API Anatoly Burakov
                         ` (4 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman,
	konstantin.ananyev, jerinjacobk, bruce.richardson, thomas,
	timothy.mcdaniel, gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. Pause instruction

   Instead of move the core into deeper C state, this method uses the
   pause instruction to avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
      - Add some debug logging
    - Replace x86-specific code path to generic path using the intrinsic check

 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/pmd_mgmt.h            |  38 ++++
 lib/librte_power/rte_power_pmd_mgmt.c  | 244 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  88 +++++++++
 lib/librte_power/rte_power_version.map |   4 +
 5 files changed, 377 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/pmd_mgmt.h
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/pmd_mgmt.h b/lib/librte_power/pmd_mgmt.h
new file mode 100644
index 0000000000..20be53bacf
--- /dev/null
+++ b/lib/librte_power/pmd_mgmt.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _PMD_MGMT_H
+#define _PMD_MGMT_H
+
+/**
+ * @file
+ * Power Management
+ */
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< Power mgmt Callback mode */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Empty poll number */
+	uint16_t empty_poll_stats;
+	/**< Callback instance  */
+	const struct rte_eth_rxtx_callback *cur_cb;
+} __rte_cache_aligned;
+
+struct pmd_port_cfg {
+	int  ref_cnt;
+	struct pmd_queue_cfg *queue_cfg;
+} __rte_cache_aligned;
+
+#endif
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..07dfe7c077
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,244 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+#include "pmd_mgmt.h"
+
+
+#define EMPTYPOLL_MAX  512
+#define PAUSE_NUM  64
+
+static struct pmd_port_cfg port_cfg[RTE_MAX_ETHPORTS];
+
+static uint16_t
+rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = rte_eth_get_wake_addr(port_id, qidx,
+					&target_addr, &expected, &mask);
+			if (ret == 0)
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor(target_addr, expected,
+						mask, -1ULL);
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+	int i;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			for (i = 0; i < PAUSE_NUM; i++)
+				rte_pause();
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/*scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+
+		}
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scal up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret = 0;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (port_cfg[port_id].queue_cfg == NULL) {
+		port_cfg[port_id].ref_cnt = 0;
+		/* allocate memory for empty poll stats */
+		port_cfg[port_id].queue_cfg  = rte_malloc_socket(NULL,
+					sizeof(struct pmd_queue_cfg)
+					* RTE_MAX_QUEUES_PER_PORT,
+					0, dev->data->numa_node);
+		if (port_cfg[port_id].queue_cfg == NULL)
+			return -ENOMEM;
+	}
+
+	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto failure_handler;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	{
+		/* check if rte_power_monitor is supported */
+		uint64_t dummy_expected, dummy_mask;
+		struct rte_cpu_intrinsics i;
+		void *dummy_addr;
+
+		rte_cpu_get_intrinsics_support(&i);
+
+		if (!i.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto failure_handler;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_wake_addr(port_id, queue_id, &dummy_addr,
+				&dummy_expected, &dummy_mask) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support get_wake_addr\n");
+			ret = -ENOTSUP;
+			goto failure_handler;
+		}
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+						rte_power_mgmt_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+			!rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto failure_handler;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto failure_handler;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto failure_handler;
+		}
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+					rte_power_mgmt_scalefreq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+						rte_power_mgmt_pause, NULL);
+		break;
+	}
+	queue_cfg->cb_mode = mode;
+	port_cfg[port_id].ref_cnt++;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	return ret;
+
+failure_handler:
+	if (port_cfg[port_id].ref_cnt == 0) {
+		rte_free(port_cfg[port_id].queue_cfg);
+		port_cfg[port_id].queue_cfg = NULL;
+	}
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	if (port_cfg[port_id].ref_cnt <= 0)
+		return -EINVAL;
+
+	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED)
+		return -EINVAL;
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+					   queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+					   queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/* it's not recommend to free callback instance here.
+	 * it cause memory leak which is a known issue.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	port_cfg[port_id].ref_cnt--;
+
+	if (port_cfg[port_id].ref_cnt == 0) {
+		rte_free(port_cfg[port_id].queue_cfg);
+		port_cfg[port_id].queue_cfg = NULL;
+	}
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..8b110f1148
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (15 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 05/10] power: add PMD power management API and callback Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-12  7:46         ` Wang, Haiyue
  2020-10-12  8:09         ` Wang, Haiyue
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 07/10] net/i40e: " Anatoly Burakov
                         ` (3 subsequent siblings)
  20 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jeff Guo, Haiyue Wang, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 0b98e210e7..30b3f416d4 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_wake_addr        = ixgbe_get_wake_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf5137..7a9fd2aec6 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b22..75020fa2fc 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 07/10] net/i40e: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (16 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-14  3:19         ` Guo, Jia
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 08/10] net/ice: " Anatoly Burakov
                         ` (2 subsequent siblings)
  20 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Beilei Xing, Jeff Guo, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 943cfe71dc..cab86f8ec9 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_wake_addr	              = i40e_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 322fc1ed75..c17f27292f 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..f23a2073e3 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 08/10] net/ice: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (17 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 07/10] net/i40e: " Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 10/10] doc: update programmer's guide for power library Anatoly Burakov
  20 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Qiming Yang, Qi Zhang, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index d8ce09d28f..260de5dfd7 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_wake_addr	              = ice_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 93a0ac6918..9e55eca942 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -25,6 +25,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1c23c7541e..c729e474c9 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		      uint64_t *expected, uint64_t *mask);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 09/10] examples/l3fwd-power: enable PMD power mgmt
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (18 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 08/10] net/ice: " Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 10/10] doc: update programmer's guide for power library Anatoly Burakov
  20 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, konstantin.ananyev, jerinjacobk,
	bruce.richardson, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Moved doc update here
    - Some minor formatting fixes

 .../sample_app_ug/l3_forward_power_man.rst    | 13 ++++++
 examples/l3fwd-power/main.c                   | 41 ++++++++++++++++++-
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 0cc6f2e62e..8722fbaeaa 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -459,3 +461,14 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./examples/l3fwd-power/build/l3fwd-power --pmd-mgmt -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)"
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index d0e6c9bd77..af64dd521f 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,7 +200,8 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
@@ -1750,6 +1752,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1771,6 +1774,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1881,6 +1885,16 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt  mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2437,6 +2451,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2705,6 +2721,12 @@ main(int argc, char **argv)
 			} else if (!check_ptype(portid))
 				rte_exit(EXIT_FAILURE,
 					 "PMD can not provide needed ptypes\n");
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				rte_power_pmd_mgmt_queue_enable(lcore_id,
+							portid, queueid,
+						RTE_POWER_MGMT_TYPE_SCALE);
+
+			}
 		}
 	}
 
@@ -2790,6 +2812,9 @@ main(int argc, char **argv)
 						SKIP_MASTER);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MASTER);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL,
+					 CALL_MASTER);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2812,6 +2837,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 10/10] doc: update programmer's guide for power library
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (19 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  20 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, konstantin.ananyev, jerinjacobk,
	bruce.richardson, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Update programmer's guide to document PMD power management usage.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Moved l3fwd-power update to the l3fwd-power-related commit
    - Some rewordings and clarifications

 doc/guides/prog_guide/power_man.rst | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..38c64d31e4 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,45 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * UMWAIT/UMONITOR
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX descriptor
+   address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will use the `rte_pause` function to avoid busy
+   polling.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing power library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +239,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics Anatoly Burakov
@ 2020-10-09 16:09         ` Jerin Jacob
  2020-10-09 16:24           ` Burakov, Anatoly
  2020-10-12 19:47         ` David Christensen
  1 sibling, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-09 16:09 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dpdk-dev, Liang Ma, Jan Viktorin, Ruifeng Wang,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	David Hunt, Thomas Monjalon, McDaniel, Timothy, Gage Eads,
	chris.macnamara

On Fri, Oct 9, 2020 at 9:32 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> From: Liang Ma <liang.j.ma@intel.com>
>
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
>
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
>
> For more details, please refer to Intel(R) 64 and IA-32 Architectures
> Software Developer's Manual, Volume 2.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
>     v5:
>     - Removed return values
>     - Simplified intrinsics and hardcoded C0.2 state
>     - Added other arch stubs
>
>  lib/librte_eal/arm/include/meson.build        |   1 +
>  .../arm/include/rte_power_intrinsics.h        |  62 ++++++++++
>  .../include/generic/rte_power_intrinsics.h    |  61 ++++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/ppc/include/meson.build        |   1 +
>  .../ppc/include/rte_power_intrinsics.h        |  62 ++++++++++
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 106 ++++++++++++++++++
>  8 files changed, 295 insertions(+)
>  create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
>
> diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
> index 73b750a18f..c6a9f70d73 100644
> --- a/lib/librte_eal/arm/include/meson.build
> +++ b/lib/librte_eal/arm/include/meson.build
> @@ -20,6 +20,7 @@ arch_headers = files(
>         'rte_pause_32.h',
>         'rte_pause_64.h',
>         'rte_pause.h',
> +       'rte_power_intrinsics.h',
>         'rte_prefetch_32.h',
>         'rte_prefetch_64.h',
>         'rte_prefetch.h',
> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..4aad44a0b9
> --- /dev/null
> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_ARM_H_
> +#define _RTE_POWER_INTRINSIC_ARM_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * This function is not supported on ARM.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 0 on success


remove return as it is a void function

> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(p);
> +       RTE_SET_USED(expected_value);
> +       RTE_SET_USED(value_mask);
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +/**
> + * This function is not supported on ARM.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..e36c1f8976
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,61 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, a certain TSC timestamp is reached, or other
> + * reasons cause the CPU to wake up.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   - 0 on success
> + *   - -ENOTSUP if not supported
> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint64_t tsc_timestamp);
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index cd09027958..3a12e87e19 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -60,6 +60,7 @@ generic_headers = files(
>         'generic/rte_memcpy.h',
>         'generic/rte_pause.h',
>         'generic/rte_prefetch.h',
> +       'generic/rte_power_intrinsics.h',
>         'generic/rte_rwlock.h',
>         'generic/rte_spinlock.h',
>         'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
> index ab4bd28092..0873b2aecb 100644
> --- a/lib/librte_eal/ppc/include/meson.build
> +++ b/lib/librte_eal/ppc/include/meson.build
> @@ -10,6 +10,7 @@ arch_headers = files(
>         'rte_io.h',
>         'rte_memcpy.h',
>         'rte_pause.h',
> +       'rte_power_intrinsics.h',
>         'rte_prefetch.h',
>         'rte_rwlock.h',
>         'rte_spinlock.h',
> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..70fd7b094f
> --- /dev/null
> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_PPC_H_
> +#define _RTE_POWER_INTRINSIC_PPC_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * This function is not supported on PPC64.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 0 on success
> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(p);
> +       RTE_SET_USED(expected_value);
> +       RTE_SET_USED(value_mask);
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +/**
> + * This function is not supported on PPC64.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>         'rte_io.h',
>         'rte_memcpy.h',
>         'rte_prefetch.h',
> +       'rte_power_intrinsics.h',
>         'rte_pause.h',
>         'rte_rtm.h',
>         'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..8d579eaf64
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,106 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, a certain TSC timestamp is reached, or other
> + * reasons cause the CPU to wake up.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
> + * For more information about usage of these instructions, please refer to
> + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 0 on success
> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint64_t tsc_timestamp)
> +{
> +       const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +       const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +       /*
> +        * we're using raw byte codes for now as only the newest compiler
> +        * versions support this instruction natively.
> +        */
> +
> +       /* set address for UMONITOR */
> +       asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +                       :
> +                       : "D"(p));
> +
> +       if (value_mask) {
> +               const uint64_t cur_value = *(const volatile uint64_t *)p;
> +               const uint64_t masked = cur_value & value_mask;
> +               /* if the masked value is already matching, abort */
> +               if (masked == expected_value)
> +                       return;
> +       }
> +       /* execute UMWAIT */
> +       asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
> +               : /* ignore rflags */
> +               : "D"(0), /* enter C0.2 */
> +                 "a"(tsc_l), "d"(tsc_h));
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction  and will enter C0.2 state. For more
> + * information about usage of this instruction, please refer to Intel(R) 64 and
> + * IA-32 Architectures Software Developer's Manual.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +       const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +
> +       /* execute TPAUSE */
> +       asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
> +                    : /* ignore rflags */
> +                    : "D"(0), /* enter C0.2 */
> +                      "a"(tsc_l), "d"(tsc_h));
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 15:39             ` Ananyev, Konstantin
@ 2020-10-09 16:10               ` Burakov, Anatoly
  2020-10-09 16:56                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:10 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 09-Oct-20 4:39 PM, Ananyev, Konstantin wrote:
> 
>> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
>>>>
>>>> Add two new power management intrinsics, and provide an implementation
>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>> compiler support for these instructions.
>>>>
>>>> The power management instructions provide an architecture-specific
>>>> function to either wait until a specified TSC timestamp is reached, or
>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>> location is written to. The monitor function also provides an optional
>>>> comparison, to avoid sleeping when the expected write has already
>>>> happened, and no more writes are expected.
>>>
>>> I think what this API is missing - a function to wakeup sleeping core.
>>> If user can/should use some system call to achieve that, then at least
>>> it has to be clearly documented, even better some wrapper provided.
>>
>> I don't think it's possible to do that without severely overcomplicating
>> the intrinsic and its usage, because AFAIK the only way to wake up a
>> sleeping core would be to send some kind of interrupt to the core, or
>> trigger a write to the cache-line in question.
>>
> 
> Yes, I think we either need a syscall that would do an IPI for us
> (on top of my head - membarrier() does that, might be there are some other syscalls too),
> or something hand-made. For hand-made, I wonder would something like that
> be safe and sufficient:
> uint64_t val = atomic_load(addr);
> CAS(addr, val, &val);
> ?
> Anyway, one way or another - I think ability to wakeup core we put to sleep
> have to be an essential part of this feature.
> As I understand linux kernel will limit max amount of sleep time for these instructions:
> https://lwn.net/Articles/790920/
> But relying just on that, seems too vague for me:
> - user can adjust that value
> - wouldn't apply to older kernels and non-linux cases
> Konstantin
> 

This implies knowing the value the core is sleeping on. That's not 
always the case - with this particular PMD power management scheme, we 
get the address from the PMD and it stays inside the callback.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API
  2020-10-08 22:26         ` Ananyev, Konstantin
@ 2020-10-09 16:11           ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:11 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 08-Oct-20 11:26 PM, Ananyev, Konstantin wrote:
>>
>> Add a simple API allow ethdev get wake up address from PMD.
>> Also include internal structure update.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   lib/librte_ethdev/rte_ethdev.c           | 19 ++++++++++++++++
>>   lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
>>   lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
>>   lib/librte_ethdev/rte_ethdev_version.map |  1 +
>>   4 files changed, 72 insertions(+)
>>
>> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
>> index d7668114ca..88253d95f9 100644
>> --- a/lib/librte_ethdev/rte_ethdev.c
>> +++ b/lib/librte_ethdev/rte_ethdev.c
>> @@ -4804,6 +4804,25 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>>          dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
>>   }
>>
>> +int
>> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
>> +      volatile void **wake_addr,
>> +      uint64_t *expected, uint64_t *mask)
>> +{
>> +struct rte_eth_dev *dev;
>> +uint16_t ret;
>> +
>> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
>> +
>> +dev = &rte_eth_devices[port_id];
>> +
>> +ret = (*dev->dev_ops->get_wake_addr)
>> +(dev->data->rx_queues[queue_id],
>> + wake_addr, expected, mask);
> 
> 
> This is an optional dev_ops, so I think you need to check that get_wake_addr()
> is defined for that PMD.
> Plus you need to check that queue_id is valid.
> 

Sorry, added dev_ops check but left out the queue id part :( Will fix in v6.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics
  2020-10-09 16:09         ` Jerin Jacob
@ 2020-10-09 16:24           ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:24 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Liang Ma, Jan Viktorin, Ruifeng Wang,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	David Hunt, Thomas Monjalon, McDaniel, Timothy, Gage Eads,
	chris.macnamara

On 09-Oct-20 5:09 PM, Jerin Jacob wrote:
> On Fri, Oct 9, 2020 at 9:32 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Add two new power management intrinsics, and provide an implementation
>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>> are implemented as raw byte opcodes because there is not yet widespread
>> compiler support for these instructions.
>>
>> The power management instructions provide an architecture-specific
>> function to either wait until a specified TSC timestamp is reached, or
>> optionally wait until either a TSC timestamp is reached or a memory
>> location is written to. The monitor function also provides an optional
>> comparison, to avoid sleeping when the expected write has already
>> happened, and no more writes are expected.
>>
>> For more details, please refer to Intel(R) 64 and IA-32 Architectures
>> Software Developer's Manual, Volume 2.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v5:
>>      - Removed return values
>>      - Simplified intrinsics and hardcoded C0.2 state
>>      - Added other arch stubs

<snip>

>> +
>> +/**
>> + * This function is not supported on ARM.
>> + *
>> + * @param p
>> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>> + * @param expected_value
>> + *   Before attempting the monitoring, the `p` address may be read and compared
>> + *   against this value. If `value_mask` is zero, this step will be skipped.
>> + * @param value_mask
>> + *   The 64-bit mask to use to extract current value from `p`.
>> + * @param tsc_timestamp
>> + *   Maximum TSC timestamp to wait for.
>> + *
>> + * @return
>> + *   - 0 on success
> 
> 
> remove return as it is a void function

Oops, will fix in v6. Thanks!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback Liang Ma
@ 2020-10-09 16:38         ` Ananyev, Konstantin
  2020-10-09 16:47           ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 16:38 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly

> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API support 1 port to multiple core use case.
> 
> This design leverage RX Callback mechnaism which allow three
> different power management methodology co exist.
> 
> 1. umwait/umonitor:
> 
>    The TSC timestamp is automatically calculated using current
>    link speed and RX descriptor ring size, such that the sleep
>    time is not longer than it would take for a NIC to fill its
>    entire RX descriptor ring.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this lightweight
>    method use Pause instruction to relief the processor from
>    busy polling.
> 
> 3. Frequency Scaling
>    Reuse exist rte power library to scale up/down core frequency
>    depend on traffic volume.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_power/meson.build           |   5 +-
>  lib/librte_power/pmd_mgmt.h            |  49 ++++++
>  lib/librte_power/rte_power_pmd_mgmt.c  | 208 +++++++++++++++++++++++++
>  lib/librte_power/rte_power_pmd_mgmt.h  |  88 +++++++++++
>  lib/librte_power/rte_power_version.map |   4 +
>  5 files changed, 352 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_power/pmd_mgmt.h
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
> 
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 78c031c943..cc3c7a8646 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
>  		'power_kvm_vm.c', 'guest_channel.c',
>  		'rte_power_empty_poll.c',
>  		'power_pstate_cpufreq.c',
> +		'rte_power_pmd_mgmt.c',
>  		'power_common.c')
> -headers = files('rte_power.h','rte_power_empty_poll.h')
> -deps += ['timer']
> +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
> +deps += ['timer' ,'ethdev']
> diff --git a/lib/librte_power/pmd_mgmt.h b/lib/librte_power/pmd_mgmt.h
> new file mode 100644
> index 0000000000..756fbe20f7
> --- /dev/null
> +++ b/lib/librte_power/pmd_mgmt.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#ifndef _PMD_MGMT_H
> +#define _PMD_MGMT_H
> +
> +/**
> + * @file
> + * Power Management
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * Possible power management states of an ethdev port.
> + */
> +enum pmd_mgmt_state {
> +	/** Device power management is disabled. */
> +	PMD_MGMT_DISABLED = 0,
> +	/** Device power management is enabled. */
> +	PMD_MGMT_ENABLED,
> +};
> +
> +struct pmd_queue_cfg {
> +	enum pmd_mgmt_state pwr_mgmt_state;
> +	/**< Power mgmt Callback mode */
> +	enum rte_power_pmd_mgmt_type cb_mode;
> +	/**< Empty poll number */
> +	uint16_t empty_poll_stats;
> +	/**< Callback instance  */
> +	const struct rte_eth_rxtx_callback *cur_cb;
> +} __rte_cache_aligned;
> +
> +struct pmd_port_cfg {
> +	int  ref_cnt;
> +	struct pmd_queue_cfg *queue_cfg;
> +} __rte_cache_aligned;
> +
> +
> +
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
> new file mode 100644
> index 0000000000..35d2af46a4
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
> @@ -0,0 +1,208 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_malloc.h>
> +#include <rte_ethdev.h>
> +#include <rte_power_intrinsics.h>
> +
> +#include "rte_power_pmd_mgmt.h"
> +#include "pmd_mgmt.h"
> +
> +
> +#define EMPTYPOLL_MAX  512
> +#define PAUSE_NUM  64
> +
> +static struct pmd_port_cfg port_cfg[RTE_MAX_ETHPORTS];
> +
> +static uint16_t
> +rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +
> +	struct pmd_queue_cfg *q_conf;
> +	q_conf = &port_cfg[port_id].queue_cfg[qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {


Here and in other places - wouldn't it be better to empty_poll_max as configurable
parameter, instead of constant value? 

> +			volatile void *target_addr;
> +			uint64_t expected, mask;
> +			uint16_t ret;
> +
> +			/*
> +			 * get address of next descriptor in the RX
> +			 * ring for this queue, as well as expected
> +			 * value and a mask.
> +			 */
> +			ret = rte_eth_get_wake_addr(port_id, qidx,
> +						    &target_addr, &expected,
> +						    &mask);
> +			if (ret == 0)
> +				/* -1ULL is maximum value for TSC */
> +				rte_power_monitor(target_addr,
> +						  expected, mask,
> +						  0, -1ULL);


Why not make timeout a user specified parameter?

> +		}
> +	} else
> +		q_conf->empty_poll_stats = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct pmd_queue_cfg *q_conf;
> +	int i;
> +	q_conf = &port_cfg[port_id].queue_cfg[qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +			for (i = 0; i < PAUSE_NUM; i++)
> +				rte_pause();

Just rte_delay_us(timeout) instead of this loop?

> +		}
> +	} else
> +		q_conf->empty_poll_stats = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct pmd_queue_cfg *q_conf;
> +	q_conf = &port_cfg[port_id].queue_cfg[qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +			/*scale down freq */
> +			rte_power_freq_min(rte_lcore_id());
> +
> +		}
> +	} else {
> +		q_conf->empty_poll_stats = 0;
> +		/* scal up freq */
> +		rte_power_freq_max(rte_lcore_id());
> +	}
> +
> +	return nb_rx;
> +}
> +

Probably worth to mention in comments that these functions enable/disable
are not MT safe.

> +int
> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> +				uint16_t port_id,
> +				uint16_t queue_id,
> +				enum rte_power_pmd_mgmt_type mode)
> +{
> +	struct rte_eth_dev *dev;
> +	struct pmd_queue_cfg *queue_cfg;
> +	int ret = 0;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (port_cfg[port_id].queue_cfg == NULL) {
> +		port_cfg[port_id].ref_cnt = 0;
> +		/* allocate memory for empty poll stats */
> +		port_cfg[port_id].queue_cfg  = rte_malloc_socket(NULL,
> +					sizeof(struct pmd_queue_cfg)
> +					* RTE_MAX_QUEUES_PER_PORT,
> +					0, dev->data->numa_node);
> +		if (port_cfg[port_id].queue_cfg == NULL)
> +			return -ENOMEM;
> +	}
> +
> +	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> +		ret = -EINVAL;
> +		goto failure_handler;
> +	}
> +
> +	switch (mode) {
> +	case RTE_POWER_MGMT_TYPE_WAIT:
> +		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> +			ret = -ENOTSUP;
> +			goto failure_handler;
> +		}
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +						rte_power_mgmt_umwait, NULL);
> +		break;
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +		/* init scale freq */
> +		if (rte_power_init(lcore_id)) {
> +			ret = -EINVAL;
> +			goto failure_handler;
> +		}
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +					rte_power_mgmt_scalefreq, NULL);
> +		break;
> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +						rte_power_mgmt_pause, NULL);
> +		break;

	default:
		....

> +	}
> +	queue_cfg->cb_mode = mode;
> +	port_cfg[port_id].ref_cnt++;
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +	return ret;
> +
> +failure_handler:
> +	if (port_cfg[port_id].ref_cnt == 0) {
> +		rte_free(port_cfg[port_id].queue_cfg);
> +		port_cfg[port_id].queue_cfg = NULL;
> +	}
> +	return ret;
> +}
> +
> +int
> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> +				uint16_t port_id,
> +				uint16_t queue_id)
> +{
> +	struct pmd_queue_cfg *queue_cfg;
> +
> +	if (port_cfg[port_id].ref_cnt <= 0)
> +		return -EINVAL;
> +
> +	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED)
> +		return -EINVAL;
> +
> +	switch (queue_cfg->cb_mode) {
> +	case RTE_POWER_MGMT_TYPE_WAIT:

Think we need wakeup(lcore_id) here.

> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +					   queue_cfg->cur_cb);
> +		break;
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +		rte_power_freq_max(lcore_id);
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +					   queue_cfg->cur_cb);
> +		rte_power_exit(lcore_id);
> +		break;
> +	}
> +	/* it's not recommend to free callback instance here.
> +	 * it cause memory leak which is a known issue.
> +	 */
> +	queue_cfg->cur_cb = NULL;
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +	port_cfg[port_id].ref_cnt--;
> +
> +	if (port_cfg[port_id].ref_cnt == 0) {
> +		rte_free(port_cfg[port_id].queue_cfg);

It is not safe to do so, unless device is already stopped.
Otherwise you need some sync mechanism here (hand-made as bpf lib, or rcu online/offline, or...)

> +		port_cfg[port_id].queue_cfg = NULL;
> +	}
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback
  2020-10-09 16:38         ` Ananyev, Konstantin
@ 2020-10-09 16:47           ` Burakov, Anatoly
  2020-10-09 16:51             ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:47 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 09-Oct-20 5:38 PM, Ananyev, Konstantin wrote:
>> Add a simple on/off switch that will enable saving power when no
>> packets are arriving. It is based on counting the number of empty
>> polls and, when the number reaches a certain threshold, entering an
>> architecture-defined optimized power state that will either wait
>> until a TSC timestamp expires, or when packets arrive.
>>
>> This API support 1 port to multiple core use case.
>>
>> This design leverage RX Callback mechnaism which allow three
>> different power management methodology co exist.
>>
>> 1. umwait/umonitor:
>>
>>     The TSC timestamp is automatically calculated using current
>>     link speed and RX descriptor ring size, such that the sleep
>>     time is not longer than it would take for a NIC to fill its
>>     entire RX descriptor ring.
>>
>> 2. Pause instruction
>>
>>     Instead of move the core into deeper C state, this lightweight
>>     method use Pause instruction to relief the processor from
>>     busy polling.
>>
>> 3. Frequency Scaling
>>     Reuse exist rte power library to scale up/down core frequency
>>     depend on traffic volume.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   lib/librte_power/meson.build           |   5 +-
>>   lib/librte_power/pmd_mgmt.h            |  49 ++++++
>>   lib/librte_power/rte_power_pmd_mgmt.c  | 208 +++++++++++++++++++++++++
>>   lib/librte_power/rte_power_pmd_mgmt.h  |  88 +++++++++++
>>   lib/librte_power/rte_power_version.map |   4 +
>>   5 files changed, 352 insertions(+), 2 deletions(-)
>>   create mode 100644 lib/librte_power/pmd_mgmt.h
>>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
>>
>> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
>> index 78c031c943..cc3c7a8646 100644
>> --- a/lib/librte_power/meson.build
>> +++ b/lib/librte_power/meson.build
>> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
>>   'power_kvm_vm.c', 'guest_channel.c',
>>   'rte_power_empty_poll.c',
>>   'power_pstate_cpufreq.c',
>> +'rte_power_pmd_mgmt.c',
>>   'power_common.c')
>> -headers = files('rte_power.h','rte_power_empty_poll.h')
>> -deps += ['timer']
>> +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
>> +deps += ['timer' ,'ethdev']
>> diff --git a/lib/librte_power/pmd_mgmt.h b/lib/librte_power/pmd_mgmt.h
>> new file mode 100644
>> index 0000000000..756fbe20f7
>> --- /dev/null
>> +++ b/lib/librte_power/pmd_mgmt.h
>> @@ -0,0 +1,49 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2010-2020 Intel Corporation
>> + */
>> +
>> +#ifndef _PMD_MGMT_H
>> +#define _PMD_MGMT_H
>> +
>> +/**
>> + * @file
>> + * Power Management
>> + */
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +/**
>> + * Possible power management states of an ethdev port.
>> + */
>> +enum pmd_mgmt_state {
>> +/** Device power management is disabled. */
>> +PMD_MGMT_DISABLED = 0,
>> +/** Device power management is enabled. */
>> +PMD_MGMT_ENABLED,
>> +};
>> +
>> +struct pmd_queue_cfg {
>> +enum pmd_mgmt_state pwr_mgmt_state;
>> +/**< Power mgmt Callback mode */
>> +enum rte_power_pmd_mgmt_type cb_mode;
>> +/**< Empty poll number */
>> +uint16_t empty_poll_stats;
>> +/**< Callback instance  */
>> +const struct rte_eth_rxtx_callback *cur_cb;
>> +} __rte_cache_aligned;
>> +
>> +struct pmd_port_cfg {
>> +int  ref_cnt;
>> +struct pmd_queue_cfg *queue_cfg;
>> +} __rte_cache_aligned;
>> +
>> +
>> +
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif
>> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
>> new file mode 100644
>> index 0000000000..35d2af46a4
>> --- /dev/null
>> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
>> @@ -0,0 +1,208 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2010-2020 Intel Corporation
>> + */
>> +
>> +#include <rte_lcore.h>
>> +#include <rte_cycles.h>
>> +#include <rte_malloc.h>
>> +#include <rte_ethdev.h>
>> +#include <rte_power_intrinsics.h>
>> +
>> +#include "rte_power_pmd_mgmt.h"
>> +#include "pmd_mgmt.h"
>> +
>> +
>> +#define EMPTYPOLL_MAX  512
>> +#define PAUSE_NUM  64
>> +
>> +static struct pmd_port_cfg port_cfg[RTE_MAX_ETHPORTS];
>> +
>> +static uint16_t
>> +rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
>> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> +{
>> +
>> +struct pmd_queue_cfg *q_conf;
>> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
>> +
>> +if (unlikely(nb_rx == 0)) {
>> +q_conf->empty_poll_stats++;
>> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> 
> 
> Here and in other places - wouldn't it be better to empty_poll_max as configurable
> parameter, instead of constant value?

It would be more flexible, but i don't think it's "better" in the sense 
that providing additional options will only make using this (already 
under-utilized!) API harder than it needs to be.

> 
>> +volatile void *target_addr;
>> +uint64_t expected, mask;
>> +uint16_t ret;
>> +
>> +/*
>> + * get address of next descriptor in the RX
>> + * ring for this queue, as well as expected
>> + * value and a mask.
>> + */
>> +ret = rte_eth_get_wake_addr(port_id, qidx,
>> +    &target_addr, &expected,
>> +    &mask);
>> +if (ret == 0)
>> +/* -1ULL is maximum value for TSC */
>> +rte_power_monitor(target_addr,
>> +  expected, mask,
>> +  0, -1ULL);
> 
> 
> Why not make timeout a user specified parameter?

This is meant to be an "easy to use" API, we were trying to keep the 
amount of configuration to an absolute minimum. If the user wants to use 
custom timeouts, they can do so with using rte_power_monitor API explicitly.

> 
>> +}
>> +} else
>> +q_conf->empty_poll_stats = 0;
>> +
>> +return nb_rx;
>> +}
>> +
>> +static uint16_t
>> +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
>> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> +{
>> +struct pmd_queue_cfg *q_conf;
>> +int i;
>> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
>> +
>> +if (unlikely(nb_rx == 0)) {
>> +q_conf->empty_poll_stats++;
>> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> +for (i = 0; i < PAUSE_NUM; i++)
>> +rte_pause();
> 
> Just rte_delay_us(timeout) instead of this loop?

Yes, seems better, thanks.

> 
>> +}
>> +} else
>> +q_conf->empty_poll_stats = 0;
>> +
>> +return nb_rx;
>> +}
>> +
>> +static uint16_t
>> +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
>> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> +{
>> +struct pmd_queue_cfg *q_conf;
>> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
>> +
>> +if (unlikely(nb_rx == 0)) {
>> +q_conf->empty_poll_stats++;
>> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> +/*scale down freq */
>> +rte_power_freq_min(rte_lcore_id());
>> +
>> +}
>> +} else {
>> +q_conf->empty_poll_stats = 0;
>> +/* scal up freq */
>> +rte_power_freq_max(rte_lcore_id());
>> +}
>> +
>> +return nb_rx;
>> +}
>> +
> 
> Probably worth to mention in comments that these functions enable/disable
> are not MT safe.

Will do in v6.

> 
>> +int
>> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
>> +uint16_t port_id,
>> +uint16_t queue_id,
>> +enum rte_power_pmd_mgmt_type mode)
>> +{
>> +struct rte_eth_dev *dev;
>> +struct pmd_queue_cfg *queue_cfg;
>> +int ret = 0;
>> +
>> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
>> +dev = &rte_eth_devices[port_id];
>> +
>> +if (port_cfg[port_id].queue_cfg == NULL) {
>> +port_cfg[port_id].ref_cnt = 0;
>> +/* allocate memory for empty poll stats */
>> +port_cfg[port_id].queue_cfg  = rte_malloc_socket(NULL,
>> +sizeof(struct pmd_queue_cfg)
>> +* RTE_MAX_QUEUES_PER_PORT,
>> +0, dev->data->numa_node);
>> +if (port_cfg[port_id].queue_cfg == NULL)
>> +return -ENOMEM;
>> +}
>> +
>> +queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
>> +
>> +if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
>> +ret = -EINVAL;
>> +goto failure_handler;
>> +}
>> +
>> +switch (mode) {
>> +case RTE_POWER_MGMT_TYPE_WAIT:
>> +if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>> +ret = -ENOTSUP;
>> +goto failure_handler;
>> +}
>> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +rte_power_mgmt_umwait, NULL);
>> +break;
>> +case RTE_POWER_MGMT_TYPE_SCALE:
>> +/* init scale freq */
>> +if (rte_power_init(lcore_id)) {
>> +ret = -EINVAL;
>> +goto failure_handler;
>> +}
>> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +rte_power_mgmt_scalefreq, NULL);
>> +break;
>> +case RTE_POWER_MGMT_TYPE_PAUSE:
>> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +rte_power_mgmt_pause, NULL);
>> +break;
> 
> default:
> ....

Will add in v6.

> 
>> +}
>> +queue_cfg->cb_mode = mode;
>> +port_cfg[port_id].ref_cnt++;
>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> +return ret;
>> +
>> +failure_handler:
>> +if (port_cfg[port_id].ref_cnt == 0) {
>> +rte_free(port_cfg[port_id].queue_cfg);
>> +port_cfg[port_id].queue_cfg = NULL;
>> +}
>> +return ret;
>> +}
>> +
>> +int
>> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>> +uint16_t port_id,
>> +uint16_t queue_id)
>> +{
>> +struct pmd_queue_cfg *queue_cfg;
>> +
>> +if (port_cfg[port_id].ref_cnt <= 0)
>> +return -EINVAL;
>> +
>> +queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
>> +
>> +if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED)
>> +return -EINVAL;
>> +
>> +switch (queue_cfg->cb_mode) {
>> +case RTE_POWER_MGMT_TYPE_WAIT:
> 
> Think we need wakeup(lcore_id) here.

Not sure what you mean? Could you please elaborate?

> 
>> +case RTE_POWER_MGMT_TYPE_PAUSE:
>> +rte_eth_remove_rx_callback(port_id, queue_id,
>> +   queue_cfg->cur_cb);
>> +break;
>> +case RTE_POWER_MGMT_TYPE_SCALE:
>> +rte_power_freq_max(lcore_id);
>> +rte_eth_remove_rx_callback(port_id, queue_id,
>> +   queue_cfg->cur_cb);
>> +rte_power_exit(lcore_id);
>> +break;
>> +}
>> +/* it's not recommend to free callback instance here.
>> + * it cause memory leak which is a known issue.
>> + */
>> +queue_cfg->cur_cb = NULL;
>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>> +port_cfg[port_id].ref_cnt--;
>> +
>> +if (port_cfg[port_id].ref_cnt == 0) {
>> +rte_free(port_cfg[port_id].queue_cfg);
> 
> It is not safe to do so, unless device is already stopped.
> Otherwise you need some sync mechanism here (hand-made as bpf lib, or rcu online/offline, or...)

Not sure what you mean. We're not freeing the callback structure, we're 
freeing the local data structure holding the per-port status.

> 
>> +port_cfg[port_id].queue_cfg = NULL;
>> +}
>> +return 0;
>> +}


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback
  2020-10-09 16:47           ` Burakov, Anatoly
@ 2020-10-09 16:51             ` Ananyev, Konstantin
  2020-10-09 16:56               ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 16:51 UTC (permalink / raw)
  To: Burakov, Anatoly, Ma, Liang J, dev; +Cc: Hunt, David, stephen

> On 09-Oct-20 5:38 PM, Ananyev, Konstantin wrote:
> >> Add a simple on/off switch that will enable saving power when no
> >> packets are arriving. It is based on counting the number of empty
> >> polls and, when the number reaches a certain threshold, entering an
> >> architecture-defined optimized power state that will either wait
> >> until a TSC timestamp expires, or when packets arrive.
> >>
> >> This API support 1 port to multiple core use case.
> >>
> >> This design leverage RX Callback mechnaism which allow three
> >> different power management methodology co exist.
> >>
> >> 1. umwait/umonitor:
> >>
> >>     The TSC timestamp is automatically calculated using current
> >>     link speed and RX descriptor ring size, such that the sleep
> >>     time is not longer than it would take for a NIC to fill its
> >>     entire RX descriptor ring.
> >>
> >> 2. Pause instruction
> >>
> >>     Instead of move the core into deeper C state, this lightweight
> >>     method use Pause instruction to relief the processor from
> >>     busy polling.
> >>
> >> 3. Frequency Scaling
> >>     Reuse exist rte power library to scale up/down core frequency
> >>     depend on traffic volume.
> >>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>   lib/librte_power/meson.build           |   5 +-
> >>   lib/librte_power/pmd_mgmt.h            |  49 ++++++
> >>   lib/librte_power/rte_power_pmd_mgmt.c  | 208 +++++++++++++++++++++++++
> >>   lib/librte_power/rte_power_pmd_mgmt.h  |  88 +++++++++++
> >>   lib/librte_power/rte_power_version.map |   4 +
> >>   5 files changed, 352 insertions(+), 2 deletions(-)
> >>   create mode 100644 lib/librte_power/pmd_mgmt.h
> >>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
> >>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
> >>
> >> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> >> index 78c031c943..cc3c7a8646 100644
> >> --- a/lib/librte_power/meson.build
> >> +++ b/lib/librte_power/meson.build
> >> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> >>   'power_kvm_vm.c', 'guest_channel.c',
> >>   'rte_power_empty_poll.c',
> >>   'power_pstate_cpufreq.c',
> >> +'rte_power_pmd_mgmt.c',
> >>   'power_common.c')
> >> -headers = files('rte_power.h','rte_power_empty_poll.h')
> >> -deps += ['timer']
> >> +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
> >> +deps += ['timer' ,'ethdev']
> >> diff --git a/lib/librte_power/pmd_mgmt.h b/lib/librte_power/pmd_mgmt.h
> >> new file mode 100644
> >> index 0000000000..756fbe20f7
> >> --- /dev/null
> >> +++ b/lib/librte_power/pmd_mgmt.h
> >> @@ -0,0 +1,49 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2010-2020 Intel Corporation
> >> + */
> >> +
> >> +#ifndef _PMD_MGMT_H
> >> +#define _PMD_MGMT_H
> >> +
> >> +/**
> >> + * @file
> >> + * Power Management
> >> + */
> >> +
> >> +#ifdef __cplusplus
> >> +extern "C" {
> >> +#endif
> >> +
> >> +/**
> >> + * Possible power management states of an ethdev port.
> >> + */
> >> +enum pmd_mgmt_state {
> >> +/** Device power management is disabled. */
> >> +PMD_MGMT_DISABLED = 0,
> >> +/** Device power management is enabled. */
> >> +PMD_MGMT_ENABLED,
> >> +};
> >> +
> >> +struct pmd_queue_cfg {
> >> +enum pmd_mgmt_state pwr_mgmt_state;
> >> +/**< Power mgmt Callback mode */
> >> +enum rte_power_pmd_mgmt_type cb_mode;
> >> +/**< Empty poll number */
> >> +uint16_t empty_poll_stats;
> >> +/**< Callback instance  */
> >> +const struct rte_eth_rxtx_callback *cur_cb;
> >> +} __rte_cache_aligned;
> >> +
> >> +struct pmd_port_cfg {
> >> +int  ref_cnt;
> >> +struct pmd_queue_cfg *queue_cfg;
> >> +} __rte_cache_aligned;
> >> +
> >> +
> >> +
> >> +
> >> +#ifdef __cplusplus
> >> +}
> >> +#endif
> >> +
> >> +#endif
> >> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
> >> new file mode 100644
> >> index 0000000000..35d2af46a4
> >> --- /dev/null
> >> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
> >> @@ -0,0 +1,208 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2010-2020 Intel Corporation
> >> + */
> >> +
> >> +#include <rte_lcore.h>
> >> +#include <rte_cycles.h>
> >> +#include <rte_malloc.h>
> >> +#include <rte_ethdev.h>
> >> +#include <rte_power_intrinsics.h>
> >> +
> >> +#include "rte_power_pmd_mgmt.h"
> >> +#include "pmd_mgmt.h"
> >> +
> >> +
> >> +#define EMPTYPOLL_MAX  512
> >> +#define PAUSE_NUM  64
> >> +
> >> +static struct pmd_port_cfg port_cfg[RTE_MAX_ETHPORTS];
> >> +
> >> +static uint16_t
> >> +rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
> >> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> >> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> >> +{
> >> +
> >> +struct pmd_queue_cfg *q_conf;
> >> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
> >> +
> >> +if (unlikely(nb_rx == 0)) {
> >> +q_conf->empty_poll_stats++;
> >> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> >
> >
> > Here and in other places - wouldn't it be better to empty_poll_max as configurable
> > parameter, instead of constant value?
> 
> It would be more flexible, but i don't think it's "better" in the sense
> that providing additional options will only make using this (already
> under-utilized!) API harder than it needs to be.
> 
> >
> >> +volatile void *target_addr;
> >> +uint64_t expected, mask;
> >> +uint16_t ret;
> >> +
> >> +/*
> >> + * get address of next descriptor in the RX
> >> + * ring for this queue, as well as expected
> >> + * value and a mask.
> >> + */
> >> +ret = rte_eth_get_wake_addr(port_id, qidx,
> >> +    &target_addr, &expected,
> >> +    &mask);
> >> +if (ret == 0)
> >> +/* -1ULL is maximum value for TSC */
> >> +rte_power_monitor(target_addr,
> >> +  expected, mask,
> >> +  0, -1ULL);
> >
> >
> > Why not make timeout a user specified parameter?
> 
> This is meant to be an "easy to use" API, we were trying to keep the
> amount of configuration to an absolute minimum. If the user wants to use
> custom timeouts, they can do so with using rte_power_monitor API explicitly.
> 
> >
> >> +}
> >> +} else
> >> +q_conf->empty_poll_stats = 0;
> >> +
> >> +return nb_rx;
> >> +}
> >> +
> >> +static uint16_t
> >> +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
> >> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> >> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> >> +{
> >> +struct pmd_queue_cfg *q_conf;
> >> +int i;
> >> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
> >> +
> >> +if (unlikely(nb_rx == 0)) {
> >> +q_conf->empty_poll_stats++;
> >> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> >> +for (i = 0; i < PAUSE_NUM; i++)
> >> +rte_pause();
> >
> > Just rte_delay_us(timeout) instead of this loop?
> 
> Yes, seems better, thanks.
> 
> >
> >> +}
> >> +} else
> >> +q_conf->empty_poll_stats = 0;
> >> +
> >> +return nb_rx;
> >> +}
> >> +
> >> +static uint16_t
> >> +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> >> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> >> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> >> +{
> >> +struct pmd_queue_cfg *q_conf;
> >> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
> >> +
> >> +if (unlikely(nb_rx == 0)) {
> >> +q_conf->empty_poll_stats++;
> >> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> >> +/*scale down freq */
> >> +rte_power_freq_min(rte_lcore_id());
> >> +
> >> +}
> >> +} else {
> >> +q_conf->empty_poll_stats = 0;
> >> +/* scal up freq */
> >> +rte_power_freq_max(rte_lcore_id());
> >> +}
> >> +
> >> +return nb_rx;
> >> +}
> >> +
> >
> > Probably worth to mention in comments that these functions enable/disable
> > are not MT safe.
> 
> Will do in v6.
> 
> >
> >> +int
> >> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> >> +uint16_t port_id,
> >> +uint16_t queue_id,
> >> +enum rte_power_pmd_mgmt_type mode)
> >> +{
> >> +struct rte_eth_dev *dev;
> >> +struct pmd_queue_cfg *queue_cfg;
> >> +int ret = 0;
> >> +
> >> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> >> +dev = &rte_eth_devices[port_id];
> >> +
> >> +if (port_cfg[port_id].queue_cfg == NULL) {
> >> +port_cfg[port_id].ref_cnt = 0;
> >> +/* allocate memory for empty poll stats */
> >> +port_cfg[port_id].queue_cfg  = rte_malloc_socket(NULL,
> >> +sizeof(struct pmd_queue_cfg)
> >> +* RTE_MAX_QUEUES_PER_PORT,
> >> +0, dev->data->numa_node);
> >> +if (port_cfg[port_id].queue_cfg == NULL)
> >> +return -ENOMEM;
> >> +}
> >> +
> >> +queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
> >> +
> >> +if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> >> +ret = -EINVAL;
> >> +goto failure_handler;
> >> +}
> >> +
> >> +switch (mode) {
> >> +case RTE_POWER_MGMT_TYPE_WAIT:
> >> +if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> >> +ret = -ENOTSUP;
> >> +goto failure_handler;
> >> +}
> >> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> +rte_power_mgmt_umwait, NULL);
> >> +break;
> >> +case RTE_POWER_MGMT_TYPE_SCALE:
> >> +/* init scale freq */
> >> +if (rte_power_init(lcore_id)) {
> >> +ret = -EINVAL;
> >> +goto failure_handler;
> >> +}
> >> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> +rte_power_mgmt_scalefreq, NULL);
> >> +break;
> >> +case RTE_POWER_MGMT_TYPE_PAUSE:
> >> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> +rte_power_mgmt_pause, NULL);
> >> +break;
> >
> > default:
> > ....
> 
> Will add in v6.
> 
> >
> >> +}
> >> +queue_cfg->cb_mode = mode;
> >> +port_cfg[port_id].ref_cnt++;
> >> +queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> >> +return ret;
> >> +
> >> +failure_handler:
> >> +if (port_cfg[port_id].ref_cnt == 0) {
> >> +rte_free(port_cfg[port_id].queue_cfg);
> >> +port_cfg[port_id].queue_cfg = NULL;
> >> +}
> >> +return ret;
> >> +}
> >> +
> >> +int
> >> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> >> +uint16_t port_id,
> >> +uint16_t queue_id)
> >> +{
> >> +struct pmd_queue_cfg *queue_cfg;
> >> +
> >> +if (port_cfg[port_id].ref_cnt <= 0)
> >> +return -EINVAL;
> >> +
> >> +queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
> >> +
> >> +if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED)
> >> +return -EINVAL;
> >> +
> >> +switch (queue_cfg->cb_mode) {
> >> +case RTE_POWER_MGMT_TYPE_WAIT:
> >
> > Think we need wakeup(lcore_id) here.
> 
> Not sure what you mean? Could you please elaborate?
> 
> >
> >> +case RTE_POWER_MGMT_TYPE_PAUSE:
> >> +rte_eth_remove_rx_callback(port_id, queue_id,
> >> +   queue_cfg->cur_cb);
> >> +break;
> >> +case RTE_POWER_MGMT_TYPE_SCALE:
> >> +rte_power_freq_max(lcore_id);
> >> +rte_eth_remove_rx_callback(port_id, queue_id,
> >> +   queue_cfg->cur_cb);
> >> +rte_power_exit(lcore_id);
> >> +break;
> >> +}
> >> +/* it's not recommend to free callback instance here.
> >> + * it cause memory leak which is a known issue.
> >> + */
> >> +queue_cfg->cur_cb = NULL;
> >> +queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> >> +port_cfg[port_id].ref_cnt--;
> >> +
> >> +if (port_cfg[port_id].ref_cnt == 0) {
> >> +rte_free(port_cfg[port_id].queue_cfg);
> >
> > It is not safe to do so, unless device is already stopped.
> > Otherwise you need some sync mechanism here (hand-made as bpf lib, or rcu online/offline, or...)
> 
> Not sure what you mean. We're not freeing the callback structure, we're
> freeing the local data structure holding the per-port status.

What is the difference?
You still trying to free memory that might be used by your DP thread
that still executes the callback.

> 
> >
> >> +port_cfg[port_id].queue_cfg = NULL;
> >> +}
> >> +return 0;
> >> +}
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback
  2020-10-09 16:51             ` Ananyev, Konstantin
@ 2020-10-09 16:56               ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:56 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 09-Oct-20 5:51 PM, Ananyev, Konstantin wrote:

>>>
>>>> +case RTE_POWER_MGMT_TYPE_PAUSE:
>>>> +rte_eth_remove_rx_callback(port_id, queue_id,
>>>> +   queue_cfg->cur_cb);
>>>> +break;
>>>> +case RTE_POWER_MGMT_TYPE_SCALE:
>>>> +rte_power_freq_max(lcore_id);
>>>> +rte_eth_remove_rx_callback(port_id, queue_id,
>>>> +   queue_cfg->cur_cb);
>>>> +rte_power_exit(lcore_id);
>>>> +break;
>>>> +}
>>>> +/* it's not recommend to free callback instance here.
>>>> + * it cause memory leak which is a known issue.
>>>> + */
>>>> +queue_cfg->cur_cb = NULL;
>>>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>>>> +port_cfg[port_id].ref_cnt--;
>>>> +
>>>> +if (port_cfg[port_id].ref_cnt == 0) {
>>>> +rte_free(port_cfg[port_id].queue_cfg);
>>>
>>> It is not safe to do so, unless device is already stopped.
>>> Otherwise you need some sync mechanism here (hand-made as bpf lib, or rcu online/offline, or...)
>>
>> Not sure what you mean. We're not freeing the callback structure, we're
>> freeing the local data structure holding the per-port status.
> 
> What is the difference?
> You still trying to free memory that might be used by your DP thread
> that still executes the callback.

Welp, you're right :/ I'll see what i can do to fix it.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 16:10               ` Burakov, Anatoly
@ 2020-10-09 16:56                 ` Ananyev, Konstantin
  2020-10-09 16:59                   ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 16:56 UTC (permalink / raw)
  To: Burakov, Anatoly, Ma, Liang J, dev; +Cc: Hunt, David, stephen


> 
> On 09-Oct-20 4:39 PM, Ananyev, Konstantin wrote:
> >
> >> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
> >>>>
> >>>> Add two new power management intrinsics, and provide an implementation
> >>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>> compiler support for these instructions.
> >>>>
> >>>> The power management instructions provide an architecture-specific
> >>>> function to either wait until a specified TSC timestamp is reached, or
> >>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>> location is written to. The monitor function also provides an optional
> >>>> comparison, to avoid sleeping when the expected write has already
> >>>> happened, and no more writes are expected.
> >>>
> >>> I think what this API is missing - a function to wakeup sleeping core.
> >>> If user can/should use some system call to achieve that, then at least
> >>> it has to be clearly documented, even better some wrapper provided.
> >>
> >> I don't think it's possible to do that without severely overcomplicating
> >> the intrinsic and its usage, because AFAIK the only way to wake up a
> >> sleeping core would be to send some kind of interrupt to the core, or
> >> trigger a write to the cache-line in question.
> >>
> >
> > Yes, I think we either need a syscall that would do an IPI for us
> > (on top of my head - membarrier() does that, might be there are some other syscalls too),
> > or something hand-made. For hand-made, I wonder would something like that
> > be safe and sufficient:
> > uint64_t val = atomic_load(addr);
> > CAS(addr, val, &val);
> > ?
> > Anyway, one way or another - I think ability to wakeup core we put to sleep
> > have to be an essential part of this feature.
> > As I understand linux kernel will limit max amount of sleep time for these instructions:
> > https://lwn.net/Articles/790920/
> > But relying just on that, seems too vague for me:
> > - user can adjust that value
> > - wouldn't apply to older kernels and non-linux cases
> > Konstantin
> >
> 
> This implies knowing the value the core is sleeping on.

You don't the value to wait for, you just need an address.
And you can make wakeup function to accept address as a parameter,
same as monitor() does. 

> That's not
> always the case - with this particular PMD power management scheme, we
> get the address from the PMD and it stays inside the callback.

That's fine - you can store address inside you callback metadata 
and do wakeup as part of _disable_ function.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 16:56                 ` Ananyev, Konstantin
@ 2020-10-09 16:59                   ` Burakov, Anatoly
  2020-10-10 13:19                     ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:59 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 09-Oct-20 5:56 PM, Ananyev, Konstantin wrote:
> 
>>
>> On 09-Oct-20 4:39 PM, Ananyev, Konstantin wrote:
>>>
>>>> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
>>>>>>
>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>> compiler support for these instructions.
>>>>>>
>>>>>> The power management instructions provide an architecture-specific
>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>> location is written to. The monitor function also provides an optional
>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>> happened, and no more writes are expected.
>>>>>
>>>>> I think what this API is missing - a function to wakeup sleeping core.
>>>>> If user can/should use some system call to achieve that, then at least
>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>
>>>> I don't think it's possible to do that without severely overcomplicating
>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>> sleeping core would be to send some kind of interrupt to the core, or
>>>> trigger a write to the cache-line in question.
>>>>
>>>
>>> Yes, I think we either need a syscall that would do an IPI for us
>>> (on top of my head - membarrier() does that, might be there are some other syscalls too),
>>> or something hand-made. For hand-made, I wonder would something like that
>>> be safe and sufficient:
>>> uint64_t val = atomic_load(addr);
>>> CAS(addr, val, &val);
>>> ?
>>> Anyway, one way or another - I think ability to wakeup core we put to sleep
>>> have to be an essential part of this feature.
>>> As I understand linux kernel will limit max amount of sleep time for these instructions:
>>> https://lwn.net/Articles/790920/
>>> But relying just on that, seems too vague for me:
>>> - user can adjust that value
>>> - wouldn't apply to older kernels and non-linux cases
>>> Konstantin
>>>
>>
>> This implies knowing the value the core is sleeping on.
> 
> You don't the value to wait for, you just need an address.
> And you can make wakeup function to accept address as a parameter,
> same as monitor() does.

Sorry, i meant the address. We don't know the address we're sleeping on.

> 
>> That's not
>> always the case - with this particular PMD power management scheme, we
>> get the address from the PMD and it stays inside the callback.
> 
> That's fine - you can store address inside you callback metadata
> and do wakeup as part of _disable_ function.
> 

The address may be different, and by the time we access the address it 
may become stale, so i don't see how that would help unless you're 
suggesting to have some kind of synchronization mechanism there.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
@ 2020-10-09 17:06         ` Burakov, Anatoly
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
                           ` (9 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 17:06 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Bruce Richardson, Konstantin Ananyev, david.hunt,
	jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

On 09-Oct-20 5:02 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add new x86 cpuid support for WAITPKG.
> This flag indicate processor support umwait/umonitor/tpause
> instruction.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

The work on this patchset is clearly going to take a few more days, but 
this particular patch is high priority as there are other dependencies 
on this patch for DLB drivers. Would it be possible to accept this 
particular patch now...

> ---
>   lib/librte_eal/x86/include/rte_cpuflags.h | 2 ++
>   lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
>   2 files changed, 4 insertions(+)
> 
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d1..5041a830a7 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
>   	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
>   	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
>   
> +	/**< UMWAIT/TPAUSE Instructions */
> +	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */

...with typo fix (UMONITOR rather than UMINITOR)?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 16:59                   ` Burakov, Anatoly
@ 2020-10-10 13:19                     ` Ananyev, Konstantin
  2020-10-12 10:35                       ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-10 13:19 UTC (permalink / raw)
  To: Burakov, Anatoly, Ma, Liang J, dev; +Cc: Hunt, David, stephen



> >>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>> compiler support for these instructions.
> >>>>>>
> >>>>>> The power management instructions provide an architecture-specific
> >>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>> location is written to. The monitor function also provides an optional
> >>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>> happened, and no more writes are expected.
> >>>>>
> >>>>> I think what this API is missing - a function to wakeup sleeping core.
> >>>>> If user can/should use some system call to achieve that, then at least
> >>>>> it has to be clearly documented, even better some wrapper provided.
> >>>>
> >>>> I don't think it's possible to do that without severely overcomplicating
> >>>> the intrinsic and its usage, because AFAIK the only way to wake up a
> >>>> sleeping core would be to send some kind of interrupt to the core, or
> >>>> trigger a write to the cache-line in question.
> >>>>
> >>>
> >>> Yes, I think we either need a syscall that would do an IPI for us
> >>> (on top of my head - membarrier() does that, might be there are some other syscalls too),
> >>> or something hand-made. For hand-made, I wonder would something like that
> >>> be safe and sufficient:
> >>> uint64_t val = atomic_load(addr);
> >>> CAS(addr, val, &val);
> >>> ?
> >>> Anyway, one way or another - I think ability to wakeup core we put to sleep
> >>> have to be an essential part of this feature.
> >>> As I understand linux kernel will limit max amount of sleep time for these instructions:
> >>> https://lwn.net/Articles/790920/
> >>> But relying just on that, seems too vague for me:
> >>> - user can adjust that value
> >>> - wouldn't apply to older kernels and non-linux cases
> >>> Konstantin
> >>>
> >>
> >> This implies knowing the value the core is sleeping on.
> >
> > You don't the value to wait for, you just need an address.
> > And you can make wakeup function to accept address as a parameter,
> > same as monitor() does.
> 
> Sorry, i meant the address. We don't know the address we're sleeping on.
> 
> >
> >> That's not
> >> always the case - with this particular PMD power management scheme, we
> >> get the address from the PMD and it stays inside the callback.
> >
> > That's fine - you can store address inside you callback metadata
> > and do wakeup as part of _disable_ function.
> >
> 
> The address may be different, and by the time we access the address it
> may become stale, so i don't see how that would help unless you're
> suggesting to have some kind of synchronization mechanism there.

Yes, we'll need something to sync here for sure.
Sorry, I should say it straightway, to avoid further misunderstanding.
Let say, associate a spin_lock with monitor(), by analogy with pthread_cond_wait().  
Konstantin

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
@ 2020-10-11 10:07         ` Jerin Jacob
  2020-10-12  9:26           ` Burakov, Anatoly
  2020-10-12 19:52         ` David Christensen
  1 sibling, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-11 10:07 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dpdk-dev, Jan Viktorin, Ruifeng Wang, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	David Hunt, Liang Ma, Thomas Monjalon, McDaniel, Timothy,
	Gage Eads, chris.macnamara

On Fri, Oct 9, 2020 at 9:32 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> Currently, it is not possible to check support for intrinsics that
> are platform-specific, cannot be abstracted in a generic way, or do not
> have support on all architectures. The CPUID flags can be used to some
> extent, but they are only defined for their platform, while intrinsics
> will be available to all code as they are in generic headers.
>
> This patch introduces infrastructure to check support for certain
> platform-specific intrinsics, and adds support for checking support for
> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  .../arm/include/rte_power_intrinsics.h        |  8 ++++++
>  lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
>  lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
>  .../include/generic/rte_power_intrinsics.h    |  8 ++++++
>  .../ppc/include/rte_power_intrinsics.h        |  8 ++++++
>  lib/librte_eal/ppc/rte_cpuflags.c             |  6 +++++
>  lib/librte_eal/rte_eal_version.map            |  1 +
>  .../x86/include/rte_power_intrinsics.h        |  8 ++++++
>  lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
>  9 files changed, 83 insertions(+)
>
> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> index 4aad44a0b9..055ec5877a 100644
> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> @@ -17,6 +17,10 @@ extern "C" {
>  /**
>   * This function is not supported on ARM.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.

See below

> + *
>   * @param p
>   *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>   * @param expected_value
> @@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
>  /**
>   * This function is not supported on ARM.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
See below

This patch looks to me.

Since rte_power_monitor() API is public API, I think, only in the
generic header file, you need to have
these warnings and API documentation rather than repeating everywhere.



>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for.
>   *
> diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
> index caf3dc83a5..7eef11fa02 100644
> --- a/lib/librte_eal/arm/rte_cpuflags.c
> +++ b/lib/librte_eal/arm/rte_cpuflags.c
> @@ -138,3 +138,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>                 return NULL;
>         return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +       memset(intrinsics, 0, sizeof(*intrinsics));
> +}
> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
> index 872f0ebe3e..28a5aecde8 100644
> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
> @@ -13,6 +13,32 @@
>  #include "rte_common.h"
>  #include <errno.h>
>
> +#include <rte_compat.h>
> +
> +/**
> + * Structure used to describe platform-specific intrinsics that may or may not
> + * be supported at runtime.
> + */
> +struct rte_cpu_intrinsics {
> +       uint32_t power_monitor : 1;
> +       /**< indicates support for rte_power_monitor function */
> +       uint32_t power_pause : 1;
> +       /**< indicates support for rte_power_pause function */
> +};
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Check CPU support for various intrinsics at runtime.
> + *
> + * @param intrinsics
> + *     Pointer to a structure to be filled.
> + */
> +__rte_experimental
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
> +
>  /**
>   * Enumeration of all CPU features supported
>   */
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> index e36c1f8976..218eda7e86 100644
> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -26,6 +26,10 @@
>   * checked against the expected value, and if they match, the entering of
>   * optimized power state may be aborted.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.

See above

> + *
>   * @param p
>   *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>   * @param expected_value
> @@ -49,6 +53,10 @@ static inline void rte_power_monitor(const volatile void *p,
>   * Enter an architecture-defined optimized power state until a certain TSC
>   * timestamp is reached.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for. Note that the wait behavior is
>   *   architecture-dependent.
> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> index 70fd7b094f..d63ad86849 100644
> --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -17,6 +17,10 @@ extern "C" {
>  /**
>   * This function is not supported on PPC64.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param p
>   *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>   * @param expected_value
> @@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
>  /**
>   * This function is not supported on PPC64.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for.
>   *
> diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
> index 3bb7563ce9..eee8234384 100644
> --- a/lib/librte_eal/ppc/rte_cpuflags.c
> +++ b/lib/librte_eal/ppc/rte_cpuflags.c
> @@ -108,3 +108,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>                 return NULL;
>         return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +       memset(intrinsics, 0, sizeof(*intrinsics));
> +}
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index a93dea9fe6..ed944f2bd4 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -400,6 +400,7 @@ EXPERIMENTAL {
>         # added in 20.11
>         __rte_eal_trace_generic_size_t;
>         rte_service_lcore_may_be_active;
> +       rte_cpu_get_intrinsics_support;
>  };
>
>  INTERNAL {
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> index 8d579eaf64..3afc165a1f 100644
> --- a/lib/librte_eal/x86/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -29,6 +29,10 @@ extern "C" {
>   * For more information about usage of these instructions, please refer to
>   * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param p
>   *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>   * @param expected_value
> @@ -80,6 +84,10 @@ static inline void rte_power_monitor(const volatile void *p,
>   * information about usage of this instruction, please refer to Intel(R) 64 and
>   * IA-32 Architectures Software Developer's Manual.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for.
>   *
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 0325c4b93b..a96312ff7f 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -7,6 +7,7 @@
>  #include <stdio.h>
>  #include <errno.h>
>  #include <stdint.h>
> +#include <string.h>
>
>  #include "rte_cpuid.h"
>
> @@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>                 return NULL;
>         return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +       memset(intrinsics, 0, sizeof(*intrinsics));
> +
> +       if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> +               intrinsics->power_monitor = 1;
> +               intrinsics->power_pause = 1;
> +       }
> +}
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API Anatoly Burakov
@ 2020-10-12  7:46         ` Wang, Haiyue
  2020-10-12  9:28           ` Burakov, Anatoly
  2020-10-12  9:44           ` Burakov, Anatoly
  2020-10-12  8:09         ` Wang, Haiyue
  1 sibling, 2 replies; 421+ messages in thread
From: Wang, Haiyue @ 2020-10-12  7:46 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris

Hi Liang,

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Saturday, October 10, 2020 00:02
> To: dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
> Macnamara, Chris <chris.macnamara@intel.com>
> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's
> status bit.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>  drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
>  drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
> index 0b98e210e7..30b3f416d4 100644
> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> @@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
>  	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
>  	.tm_ops_get           = ixgbe_tm_ops_get,
>  	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
> +	.get_wake_addr        = ixgbe_get_wake_addr,
>  };
> 
>  /*
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index 977ecf5137..7a9fd2aec6 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -1366,6 +1366,28 @@ const uint32_t
>  		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
>  };
> 
> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *mask)
> +{
> +	volatile union ixgbe_adv_rx_desc *rxdp;
> +	struct ixgbe_rx_queue *rxq = rx_queue;
> +	uint16_t desc;
> +
> +	desc = rxq->rx_tail;
> +	rxdp = &rxq->rx_ring[desc];
> +	/* watch for changes in status bit */
> +	*tail_desc_addr = &rxdp->wb.upper.status_error;
> +
> +	/*
> +	 * we expect the DD bit to be set to 1 if this descriptor was already
> +	 * written to.
> +	 */
> +	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> +	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> +
> +	return 0;
> +}
> +

I'm wondering that whether the '.get_wake_addr' can be specific to
like 'rxq_tailq_addr_get' ? So that one day this wake up mechanism
can be applied to 'txq_tailq_addr_get' ? :-)

Also, "volatile void **tail_desc_addr, uint64_t *expected, uint64_t *mask"
can be merged into 'struct xxx' ? So that you can expand the API easily.

Just my thoughts.

Anyway, LGTM

Acked-by: Haiyue Wang <haiyue.wang@intel.com>

> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API Anatoly Burakov
  2020-10-12  7:46         ` Wang, Haiyue
@ 2020-10-12  8:09         ` Wang, Haiyue
  2020-10-12  9:28           ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Wang, Haiyue @ 2020-10-12  8:09 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris

Hi Liang,

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Saturday, October 10, 2020 00:02
> To: dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
> Macnamara, Chris <chris.macnamara@intel.com>
> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's
> status bit.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>  drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
>  drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>  3 files changed, 25 insertions(+)
> 


> 
> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *mask)
> +{
> +	volatile union ixgbe_adv_rx_desc *rxdp;
> +	struct ixgbe_rx_queue *rxq = rx_queue;
> +	uint16_t desc;
> +
> +	desc = rxq->rx_tail;
> +	rxdp = &rxq->rx_ring[desc];
> +	/* watch for changes in status bit */
> +	*tail_desc_addr = &rxdp->wb.upper.status_error;
> +
> +	/*
> +	 * we expect the DD bit to be set to 1 if this descriptor was already
> +	 * written to.
> +	 */
> +	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> +	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> +

Seems have one issue about the byte endian:
Like for BIG endian:
         *expected = rte_bswap32(IXGBE_RXDADV_STAT_DD)
             !=
         *expected = rte_bswap64(IXGBE_RXDADV_STAT_DD)

And in API 'rte_power_monitor', use uint64_t type to access the wake up
data:

static inline void rte_power_monitor(const volatile void *p,
		const uint64_t expected_value, const uint64_t value_mask,
		const uint64_t tsc_timestamp)
{
	if (value_mask) {
			const uint64_t cur_value = *(const volatile uint64_t *)p;
			const uint64_t masked = cur_value & value_mask;
			/* if the masked value is already matching, abort */
			if (masked == expected_value)
				return;
		}


So that we need the wake up address type like 16/32/64b ?

> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure
  2020-10-11 10:07         ` Jerin Jacob
@ 2020-10-12  9:26           ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12  9:26 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Jan Viktorin, Ruifeng Wang, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	David Hunt, Liang Ma, Thomas Monjalon, McDaniel, Timothy,
	Gage Eads, chris.macnamara

On 11-Oct-20 11:07 AM, Jerin Jacob wrote:
> On Fri, Oct 9, 2020 at 9:32 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> Currently, it is not possible to check support for intrinsics that
>> are platform-specific, cannot be abstracted in a generic way, or do not
>> have support on all architectures. The CPUID flags can be used to some
>> extent, but they are only defined for their platform, while intrinsics
>> will be available to all code as they are in generic headers.
>>
>> This patch introduces infrastructure to check support for certain
>> platform-specific intrinsics, and adds support for checking support for
>> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   .../arm/include/rte_power_intrinsics.h        |  8 ++++++
>>   lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
>>   lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
>>   .../include/generic/rte_power_intrinsics.h    |  8 ++++++
>>   .../ppc/include/rte_power_intrinsics.h        |  8 ++++++
>>   lib/librte_eal/ppc/rte_cpuflags.c             |  6 +++++
>>   lib/librte_eal/rte_eal_version.map            |  1 +
>>   .../x86/include/rte_power_intrinsics.h        |  8 ++++++
>>   lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
>>   9 files changed, 83 insertions(+)
>>
>> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> index 4aad44a0b9..055ec5877a 100644
>> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> @@ -17,6 +17,10 @@ extern "C" {
>>   /**
>>    * This function is not supported on ARM.
>>    *
>> + * @warning It is responsibility of the user to check if this function is
>> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
>> + *   so may result in an illegal CPU instruction error.
> 
> See below
> 
>> + *
>>    * @param p
>>    *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>>    * @param expected_value
>> @@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
>>   /**
>>    * This function is not supported on ARM.
>>    *
>> + * @warning It is responsibility of the user to check if this function is
>> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
>> + *   so may result in an illegal CPU instruction error.
>> + *
> See below
> 
> This patch looks to me.
> 
> Since rte_power_monitor() API is public API, I think, only in the
> generic header file, you need to have
> these warnings and API documentation rather than repeating everywhere.
> 

Great, will fix in v6 so. Thanks!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-12  8:09         ` Wang, Haiyue
@ 2020-10-12  9:28           ` Burakov, Anatoly
  2020-10-13  1:17             ` Wang, Haiyue
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12  9:28 UTC (permalink / raw)
  To: Wang, Haiyue, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara, Chris

On 12-Oct-20 9:09 AM, Wang, Haiyue wrote:
> Hi Liang,
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Saturday, October 10, 2020 00:02
>> To: dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
>> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
>> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
>> Macnamara, Chris <chris.macnamara@intel.com>
>> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Implement support for the power management API by implementing a
>> `get_wake_addr` function that will return an address of an RX ring's
>> status bit.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> ---
>>   drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>>   drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
>>   drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>>   3 files changed, 25 insertions(+)
>>
> 
> 
>>
>> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
>> +uint64_t *expected, uint64_t *mask)
>> +{
>> +volatile union ixgbe_adv_rx_desc *rxdp;
>> +struct ixgbe_rx_queue *rxq = rx_queue;
>> +uint16_t desc;
>> +
>> +desc = rxq->rx_tail;
>> +rxdp = &rxq->rx_ring[desc];
>> +/* watch for changes in status bit */
>> +*tail_desc_addr = &rxdp->wb.upper.status_error;
>> +
>> +/*
>> + * we expect the DD bit to be set to 1 if this descriptor was already
>> + * written to.
>> + */
>> +*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +
> 
> Seems have one issue about the byte endian:
> Like for BIG endian:
>           *expected = rte_bswap32(IXGBE_RXDADV_STAT_DD)
>               !=
>           *expected = rte_bswap64(IXGBE_RXDADV_STAT_DD)
> 
> And in API 'rte_power_monitor', use uint64_t type to access the wake up
> data:
> 
> static inline void rte_power_monitor(const volatile void *p,
> const uint64_t expected_value, const uint64_t value_mask,
> const uint64_t tsc_timestamp)
> {
> if (value_mask) {
> const uint64_t cur_value = *(const volatile uint64_t *)p;
> const uint64_t masked = cur_value & value_mask;
> /* if the masked value is already matching, abort */
> if (masked == expected_value)
> return;
> }
> 
> 
> So that we need the wake up address type like 16/32/64b ?

Endian differences strike again! You're right of course.

I suspect casting everything to CPU endinanness would fix it, would it not?

> 
>> --
>> 2.17.1


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-12  7:46         ` Wang, Haiyue
@ 2020-10-12  9:28           ` Burakov, Anatoly
  2020-10-12  9:44           ` Burakov, Anatoly
  1 sibling, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12  9:28 UTC (permalink / raw)
  To: Wang, Haiyue, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara, Chris

On 12-Oct-20 8:46 AM, Wang, Haiyue wrote:
> Hi Liang,
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Saturday, October 10, 2020 00:02
>> To: dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
>> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
>> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
>> Macnamara, Chris <chris.macnamara@intel.com>
>> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Implement support for the power management API by implementing a
>> `get_wake_addr` function that will return an address of an RX ring's
>> status bit.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> ---
>>   drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>>   drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
>>   drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
>> index 0b98e210e7..30b3f416d4 100644
>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>> @@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
>>   .udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
>>   .tm_ops_get           = ixgbe_tm_ops_get,
>>   .tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
>> +.get_wake_addr        = ixgbe_get_wake_addr,
>>   };
>>
>>   /*
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index 977ecf5137..7a9fd2aec6 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1366,6 +1366,28 @@ const uint32_t
>>   RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
>>   };
>>
>> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
>> +uint64_t *expected, uint64_t *mask)
>> +{
>> +volatile union ixgbe_adv_rx_desc *rxdp;
>> +struct ixgbe_rx_queue *rxq = rx_queue;
>> +uint16_t desc;
>> +
>> +desc = rxq->rx_tail;
>> +rxdp = &rxq->rx_ring[desc];
>> +/* watch for changes in status bit */
>> +*tail_desc_addr = &rxdp->wb.upper.status_error;
>> +
>> +/*
>> + * we expect the DD bit to be set to 1 if this descriptor was already
>> + * written to.
>> + */
>> +*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +
>> +return 0;
>> +}
>> +
> 
> I'm wondering that whether the '.get_wake_addr' can be specific to
> like 'rxq_tailq_addr_get' ? So that one day this wake up mechanism
> can be applied to 'txq_tailq_addr_get' ? :-)
> 
> Also, "volatile void **tail_desc_addr, uint64_t *expected, uint64_t *mask"
> can be merged into 'struct xxx' ? So that you can expand the API easily.
> 
> Just my thoughts.
> 
> Anyway, LGTM
> 
> Acked-by: Haiyue Wang <haiyue.wang@intel.com>

Great point, will think of how to address it.

> 
>> --
>> 2.17.1


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-12  7:46         ` Wang, Haiyue
  2020-10-12  9:28           ` Burakov, Anatoly
@ 2020-10-12  9:44           ` Burakov, Anatoly
  2020-10-12 15:58             ` Wang, Haiyue
  1 sibling, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12  9:44 UTC (permalink / raw)
  To: Wang, Haiyue, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara, Chris

On 12-Oct-20 8:46 AM, Wang, Haiyue wrote:
> Hi Liang,
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Saturday, October 10, 2020 00:02
>> To: dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
>> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
>> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
>> Macnamara, Chris <chris.macnamara@intel.com>
>> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Implement support for the power management API by implementing a
>> `get_wake_addr` function that will return an address of an RX ring's
>> status bit.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> ---
>>   drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>>   drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
>>   drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
>> index 0b98e210e7..30b3f416d4 100644
>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>> @@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
>>   .udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
>>   .tm_ops_get           = ixgbe_tm_ops_get,
>>   .tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
>> +.get_wake_addr        = ixgbe_get_wake_addr,
>>   };
>>
>>   /*
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index 977ecf5137..7a9fd2aec6 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1366,6 +1366,28 @@ const uint32_t
>>   RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
>>   };
>>
>> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
>> +uint64_t *expected, uint64_t *mask)
>> +{
>> +volatile union ixgbe_adv_rx_desc *rxdp;
>> +struct ixgbe_rx_queue *rxq = rx_queue;
>> +uint16_t desc;
>> +
>> +desc = rxq->rx_tail;
>> +rxdp = &rxq->rx_ring[desc];
>> +/* watch for changes in status bit */
>> +*tail_desc_addr = &rxdp->wb.upper.status_error;
>> +
>> +/*
>> + * we expect the DD bit to be set to 1 if this descriptor was already
>> + * written to.
>> + */
>> +*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +
>> +return 0;
>> +}
>> +
> 
> I'm wondering that whether the '.get_wake_addr' can be specific to
> like 'rxq_tailq_addr_get' ? So that one day this wake up mechanism
> can be applied to 'txq_tailq_addr_get' ? :-)

What would be the point of sleeping on TX queue though?

> 
> Also, "volatile void **tail_desc_addr, uint64_t *expected, uint64_t *mask"
> can be merged into 'struct xxx' ? So that you can expand the API easily.

Actually, i don't think we can do that. Well, we can, but we'll have to 
either define a new struct for ethdev, or define it in the power library 
and make ethdev dependent on the power library. The latter is a no-go, 
and the former i don't think is a good idea because adding a new struct 
to ethdev is big deal and i'd like to avoid that if i can.

> 
> Just my thoughts.
> 
> Anyway, LGTM
> 
> Acked-by: Haiyue Wang <haiyue.wang@intel.com>
> 
>> --
>> 2.17.1


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-10 13:19                     ` Ananyev, Konstantin
@ 2020-10-12 10:35                       ` Burakov, Anatoly
  2020-10-12 10:36                         ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12 10:35 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 10-Oct-20 2:19 PM, Ananyev, Konstantin wrote:
> 
> 
>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>> compiler support for these instructions.
>>>>>>>>
>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>> happened, and no more writes are expected.
>>>>>>>
>>>>>>> I think what this API is missing - a function to wakeup sleeping core.
>>>>>>> If user can/should use some system call to achieve that, then at least
>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>
>>>>>> I don't think it's possible to do that without severely overcomplicating
>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>> sleeping core would be to send some kind of interrupt to the core, or
>>>>>> trigger a write to the cache-line in question.
>>>>>>
>>>>>
>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>> (on top of my head - membarrier() does that, might be there are some other syscalls too),
>>>>> or something hand-made. For hand-made, I wonder would something like that
>>>>> be safe and sufficient:
>>>>> uint64_t val = atomic_load(addr);
>>>>> CAS(addr, val, &val);
>>>>> ?
>>>>> Anyway, one way or another - I think ability to wakeup core we put to sleep
>>>>> have to be an essential part of this feature.
>>>>> As I understand linux kernel will limit max amount of sleep time for these instructions:
>>>>> https://lwn.net/Articles/790920/
>>>>> But relying just on that, seems too vague for me:
>>>>> - user can adjust that value
>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>> Konstantin
>>>>>
>>>>
>>>> This implies knowing the value the core is sleeping on.
>>>
>>> You don't the value to wait for, you just need an address.
>>> And you can make wakeup function to accept address as a parameter,
>>> same as monitor() does.
>>
>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>
>>>
>>>> That's not
>>>> always the case - with this particular PMD power management scheme, we
>>>> get the address from the PMD and it stays inside the callback.
>>>
>>> That's fine - you can store address inside you callback metadata
>>> and do wakeup as part of _disable_ function.
>>>
>>
>> The address may be different, and by the time we access the address it
>> may become stale, so i don't see how that would help unless you're
>> suggesting to have some kind of synchronization mechanism there.
> 
> Yes, we'll need something to sync here for sure.
> Sorry, I should say it straightway, to avoid further misunderstanding.
> Let say, associate a spin_lock with monitor(), by analogy with pthread_cond_wait().
> Konstantin
> 

The idea was to provide an intrinsic-like function - as in, raw 
instruction call, without anything extra. We even added the masks/values 
etc. only because there's no race-less way to combine UMONITOR/UMWAIT 
without those.

Perhaps we can provide a synchronize-able wrapper around it to avoid 
adding overhead to calls that function but doesn't need the sync mechanism?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-12 10:35                       ` Burakov, Anatoly
@ 2020-10-12 10:36                         ` Burakov, Anatoly
  2020-10-12 12:50                           ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12 10:36 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 12-Oct-20 11:35 AM, Burakov, Anatoly wrote:
> On 10-Oct-20 2:19 PM, Ananyev, Konstantin wrote:
>>
>>
>>>>>>>>> Add two new power management intrinsics, and provide an 
>>>>>>>>> implementation
>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>> are implemented as raw byte opcodes because there is not yet 
>>>>>>>>> widespread
>>>>>>>>> compiler support for these instructions.
>>>>>>>>>
>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>> function to either wait until a specified TSC timestamp is 
>>>>>>>>> reached, or
>>>>>>>>> optionally wait until either a TSC timestamp is reached or a 
>>>>>>>>> memory
>>>>>>>>> location is written to. The monitor function also provides an 
>>>>>>>>> optional
>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>
>>>>>>>> I think what this API is missing - a function to wakeup sleeping 
>>>>>>>> core.
>>>>>>>> If user can/should use some system call to achieve that, then at 
>>>>>>>> least
>>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>>
>>>>>>> I don't think it's possible to do that without severely 
>>>>>>> overcomplicating
>>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>>> sleeping core would be to send some kind of interrupt to the 
>>>>>>> core, or
>>>>>>> trigger a write to the cache-line in question.
>>>>>>>
>>>>>>
>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>> (on top of my head - membarrier() does that, might be there are 
>>>>>> some other syscalls too),
>>>>>> or something hand-made. For hand-made, I wonder would something 
>>>>>> like that
>>>>>> be safe and sufficient:
>>>>>> uint64_t val = atomic_load(addr);
>>>>>> CAS(addr, val, &val);
>>>>>> ?
>>>>>> Anyway, one way or another - I think ability to wakeup core we put 
>>>>>> to sleep
>>>>>> have to be an essential part of this feature.
>>>>>> As I understand linux kernel will limit max amount of sleep time 
>>>>>> for these instructions:
>>>>>> https://lwn.net/Articles/790920/
>>>>>> But relying just on that, seems too vague for me:
>>>>>> - user can adjust that value
>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>> Konstantin
>>>>>>
>>>>>
>>>>> This implies knowing the value the core is sleeping on.
>>>>
>>>> You don't the value to wait for, you just need an address.
>>>> And you can make wakeup function to accept address as a parameter,
>>>> same as monitor() does.
>>>
>>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>>
>>>>
>>>>> That's not
>>>>> always the case - with this particular PMD power management scheme, we
>>>>> get the address from the PMD and it stays inside the callback.
>>>>
>>>> That's fine - you can store address inside you callback metadata
>>>> and do wakeup as part of _disable_ function.
>>>>
>>>
>>> The address may be different, and by the time we access the address it
>>> may become stale, so i don't see how that would help unless you're
>>> suggesting to have some kind of synchronization mechanism there.
>>
>> Yes, we'll need something to sync here for sure.
>> Sorry, I should say it straightway, to avoid further misunderstanding.
>> Let say, associate a spin_lock with monitor(), by analogy with 
>> pthread_cond_wait().
>> Konstantin
>>
> 
> The idea was to provide an intrinsic-like function - as in, raw 
> instruction call, without anything extra. We even added the masks/values 
> etc. only because there's no race-less way to combine UMONITOR/UMWAIT 
> without those.
> 
> Perhaps we can provide a synchronize-able wrapper around it to avoid 
> adding overhead to calls that function but doesn't need the sync mechanism?
> 

Also, how would having a spinlock help to synchronize? Are you 
suggesting we do UMWAIT on a spinlock address, or something to that effect?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-12 10:36                         ` Burakov, Anatoly
@ 2020-10-12 12:50                           ` Ananyev, Konstantin
  2020-10-12 13:13                             ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-12 12:50 UTC (permalink / raw)
  To: Burakov, Anatoly, Ma, Liang J, dev; +Cc: Hunt, David, stephen


> >>
> >>>>>>>>> Add two new power management intrinsics, and provide an
> >>>>>>>>> implementation
> >>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>> are implemented as raw byte opcodes because there is not yet
> >>>>>>>>> widespread
> >>>>>>>>> compiler support for these instructions.
> >>>>>>>>>
> >>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>> function to either wait until a specified TSC timestamp is
> >>>>>>>>> reached, or
> >>>>>>>>> optionally wait until either a TSC timestamp is reached or a
> >>>>>>>>> memory
> >>>>>>>>> location is written to. The monitor function also provides an
> >>>>>>>>> optional
> >>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>
> >>>>>>>> I think what this API is missing - a function to wakeup sleeping
> >>>>>>>> core.
> >>>>>>>> If user can/should use some system call to achieve that, then at
> >>>>>>>> least
> >>>>>>>> it has to be clearly documented, even better some wrapper provided.
> >>>>>>>
> >>>>>>> I don't think it's possible to do that without severely
> >>>>>>> overcomplicating
> >>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
> >>>>>>> sleeping core would be to send some kind of interrupt to the
> >>>>>>> core, or
> >>>>>>> trigger a write to the cache-line in question.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, I think we either need a syscall that would do an IPI for us
> >>>>>> (on top of my head - membarrier() does that, might be there are
> >>>>>> some other syscalls too),
> >>>>>> or something hand-made. For hand-made, I wonder would something
> >>>>>> like that
> >>>>>> be safe and sufficient:
> >>>>>> uint64_t val = atomic_load(addr);
> >>>>>> CAS(addr, val, &val);
> >>>>>> ?
> >>>>>> Anyway, one way or another - I think ability to wakeup core we put
> >>>>>> to sleep
> >>>>>> have to be an essential part of this feature.
> >>>>>> As I understand linux kernel will limit max amount of sleep time
> >>>>>> for these instructions:
> >>>>>> https://lwn.net/Articles/790920/
> >>>>>> But relying just on that, seems too vague for me:
> >>>>>> - user can adjust that value
> >>>>>> - wouldn't apply to older kernels and non-linux cases
> >>>>>> Konstantin
> >>>>>>
> >>>>>
> >>>>> This implies knowing the value the core is sleeping on.
> >>>>
> >>>> You don't the value to wait for, you just need an address.
> >>>> And you can make wakeup function to accept address as a parameter,
> >>>> same as monitor() does.
> >>>
> >>> Sorry, i meant the address. We don't know the address we're sleeping on.
> >>>
> >>>>
> >>>>> That's not
> >>>>> always the case - with this particular PMD power management scheme, we
> >>>>> get the address from the PMD and it stays inside the callback.
> >>>>
> >>>> That's fine - you can store address inside you callback metadata
> >>>> and do wakeup as part of _disable_ function.
> >>>>
> >>>
> >>> The address may be different, and by the time we access the address it
> >>> may become stale, so i don't see how that would help unless you're
> >>> suggesting to have some kind of synchronization mechanism there.
> >>
> >> Yes, we'll need something to sync here for sure.
> >> Sorry, I should say it straightway, to avoid further misunderstanding.
> >> Let say, associate a spin_lock with monitor(), by analogy with
> >> pthread_cond_wait().
> >> Konstantin
> >>
> >
> > The idea was to provide an intrinsic-like function - as in, raw
> > instruction call, without anything extra. We even added the masks/values
> > etc. only because there's no race-less way to combine UMONITOR/UMWAIT
> > without those.
>>
> > Perhaps we can provide a synchronize-able wrapper around it to avoid
> > adding overhead to calls that function but doesn't need the sync mechanism?

Yes, might be two flavours, something like
rte_power_monitor() and rte_power_monitor_sync() 
or whatever would be a better name.

> >
> 
> Also, how would having a spinlock help to synchronize? Are you
> suggesting we do UMWAIT on a spinlock address, or something to that effect?
> 

I thought about something very similar to cond_wait() working model:

/*
 * Caller has to obtain lock before calling that function.
 */
static inline int rte_power_monitor_sync(const volatile void *p,
                const uint64_t expected_value, const uint64_t value_mask,
                const uint32_t state, const uint64_t tsc_timestamp, rte_spinlock_t *lck)
{
	/* do whatever preparations are needed */
               ....
	umonitor(p);

	if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) {
		return 0;
 	}
	
	/* release lock and go to sleep */
	rte_spinlock_unlock(lck);
	rflags = umwait();

	/* grab lock back after wakeup */
	rte_spinlock_lock(lck);

	/* do rest of processing */
	....
}

/* similar go cond_signal */
static inline void rte_power_monitor_wakeup(volatile void *p)
{
	uint64_t v;

	v = __atomic_load_n(p, __ATOMIC_RELAXED);
	__atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
}               


Now in librte_power:

struct pmd_queue_cfg {
       /* to protect state and wait_addr */
       rte_spinlock_t lck;
       enum pmd_mgmt_state pwr_mgmt_state;
       void *wait_addr;
       /* rest fields */
      ....
} __rte_cache_aligned;


static uint16_t
rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
                struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
                uint16_t max_pkts __rte_unused, void *_  __rte_unused)
{

        struct pmd_queue_cfg *q_conf;
        q_conf = &port_cfg[port_id].queue_cfg[qidx];

        if (unlikely(nb_rx == 0)) {
                q_conf->empty_poll_stats++;
                if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
                        volatile void *target_addr;
                        uint64_t expected, mask;
                        uint16_t ret;
		
	         /* grab the lock and check the state */
                       rte_spinlock_lock(&q_conf->lck);
	         If (q-conf->state == ENABLED) {
	                        ret = rte_eth_get_wake_addr(port_id, qidx,
                                                    &target_addr, &expected, &mask);
		          If (ret == 0) {
			q_conf->wait_addr = target_addr;
			rte_power_monitor(target_addr, ..., &q_conf->lck);
		         }	
		          /* reset the wait_addr */
		          q_conf->wait_addr = NULL;
	         }
	         rte_spinlock_unlock(&q_conf->lck);	
	         ....
}

nt
rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
                                uint16_t port_id,
                                uint16_t queue_id)
{
	...
	/* grab the lock and change the state */
               rte_spinlock_lock(&q_conf->lck);
	queue_cfg->state = DISABLED;

	/* wakeup if necessary */
	If (queue_cfg->wakeup_addr != NULL)
		rte_power_monitor_wakeup(queue_cfg->wakeup_addr);

	rte_spinlock_unlock(&q_conf->lck);
	...
}

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-12 12:50                           ` Ananyev, Konstantin
@ 2020-10-12 13:13                             ` Burakov, Anatoly
  2020-10-13  9:45                               ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12 13:13 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 12-Oct-20 1:50 PM, Ananyev, Konstantin wrote:
> 
>>>>
>>>>>>>>>>> Add two new power management intrinsics, and provide an
>>>>>>>>>>> implementation
>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet
>>>>>>>>>>> widespread
>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>
>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>> function to either wait until a specified TSC timestamp is
>>>>>>>>>>> reached, or
>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a
>>>>>>>>>>> memory
>>>>>>>>>>> location is written to. The monitor function also provides an
>>>>>>>>>>> optional
>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>
>>>>>>>>>> I think what this API is missing - a function to wakeup sleeping
>>>>>>>>>> core.
>>>>>>>>>> If user can/should use some system call to achieve that, then at
>>>>>>>>>> least
>>>>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>>>>
>>>>>>>>> I don't think it's possible to do that without severely
>>>>>>>>> overcomplicating
>>>>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>>>>> sleeping core would be to send some kind of interrupt to the
>>>>>>>>> core, or
>>>>>>>>> trigger a write to the cache-line in question.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>>>> (on top of my head - membarrier() does that, might be there are
>>>>>>>> some other syscalls too),
>>>>>>>> or something hand-made. For hand-made, I wonder would something
>>>>>>>> like that
>>>>>>>> be safe and sufficient:
>>>>>>>> uint64_t val = atomic_load(addr);
>>>>>>>> CAS(addr, val, &val);
>>>>>>>> ?
>>>>>>>> Anyway, one way or another - I think ability to wakeup core we put
>>>>>>>> to sleep
>>>>>>>> have to be an essential part of this feature.
>>>>>>>> As I understand linux kernel will limit max amount of sleep time
>>>>>>>> for these instructions:
>>>>>>>> https://lwn.net/Articles/790920/
>>>>>>>> But relying just on that, seems too vague for me:
>>>>>>>> - user can adjust that value
>>>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>>>> Konstantin
>>>>>>>>
>>>>>>>
>>>>>>> This implies knowing the value the core is sleeping on.
>>>>>>
>>>>>> You don't the value to wait for, you just need an address.
>>>>>> And you can make wakeup function to accept address as a parameter,
>>>>>> same as monitor() does.
>>>>>
>>>>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>>>>
>>>>>>
>>>>>>> That's not
>>>>>>> always the case - with this particular PMD power management scheme, we
>>>>>>> get the address from the PMD and it stays inside the callback.
>>>>>>
>>>>>> That's fine - you can store address inside you callback metadata
>>>>>> and do wakeup as part of _disable_ function.
>>>>>>
>>>>>
>>>>> The address may be different, and by the time we access the address it
>>>>> may become stale, so i don't see how that would help unless you're
>>>>> suggesting to have some kind of synchronization mechanism there.
>>>>
>>>> Yes, we'll need something to sync here for sure.
>>>> Sorry, I should say it straightway, to avoid further misunderstanding.
>>>> Let say, associate a spin_lock with monitor(), by analogy with
>>>> pthread_cond_wait().
>>>> Konstantin
>>>>
>>>
>>> The idea was to provide an intrinsic-like function - as in, raw
>>> instruction call, without anything extra. We even added the masks/values
>>> etc. only because there's no race-less way to combine UMONITOR/UMWAIT
>>> without those.
>>>
>>> Perhaps we can provide a synchronize-able wrapper around it to avoid
>>> adding overhead to calls that function but doesn't need the sync mechanism?
> 
> Yes, might be two flavours, something like
> rte_power_monitor() and rte_power_monitor_sync()
> or whatever would be a better name.
> 
>>>
>>
>> Also, how would having a spinlock help to synchronize? Are you
>> suggesting we do UMWAIT on a spinlock address, or something to that effect?
>>
> 
> I thought about something very similar to cond_wait() working model:
> 
> /*
>   * Caller has to obtain lock before calling that function.
>   */
> static inline int rte_power_monitor_sync(const volatile void *p,
>                  const uint64_t expected_value, const uint64_t value_mask,
>                  const uint32_t state, const uint64_t tsc_timestamp, rte_spinlock_t *lck)
> {
> /* do whatever preparations are needed */
>                 ....
> umonitor(p);
> 
> if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) {
> return 0;
>   }
> 
> /* release lock and go to sleep */
> rte_spinlock_unlock(lck);
> rflags = umwait();
> 
> /* grab lock back after wakeup */
> rte_spinlock_lock(lck);
> 
> /* do rest of processing */
> ....
> }
> 
> /* similar go cond_signal */
> static inline void rte_power_monitor_wakeup(volatile void *p)
> {
> uint64_t v;
> 
> v = __atomic_load_n(p, __ATOMIC_RELAXED);
> __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
> }
> 
> 
> Now in librte_power:
> 
> struct pmd_queue_cfg {
>         /* to protect state and wait_addr */
>         rte_spinlock_t lck;
>         enum pmd_mgmt_state pwr_mgmt_state;
>         void *wait_addr;
>         /* rest fields */
>        ....
> } __rte_cache_aligned;
> 
> 
> static uint16_t
> rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
>                  struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>                  uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> {
> 
>          struct pmd_queue_cfg *q_conf;
>          q_conf = &port_cfg[port_id].queue_cfg[qidx];
> 
>          if (unlikely(nb_rx == 0)) {
>                  q_conf->empty_poll_stats++;
>                  if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>                          volatile void *target_addr;
>                          uint64_t expected, mask;
>                          uint16_t ret;
> 
>           /* grab the lock and check the state */
>                         rte_spinlock_lock(&q_conf->lck);
>           If (q-conf->state == ENABLED) {
>                          ret = rte_eth_get_wake_addr(port_id, qidx,
>                                                      &target_addr, &expected, &mask);
>            If (ret == 0) {
> q_conf->wait_addr = target_addr;
> rte_power_monitor(target_addr, ..., &q_conf->lck);
>           }
>            /* reset the wait_addr */
>            q_conf->wait_addr = NULL;
>           }
>           rte_spinlock_unlock(&q_conf->lck);
>           ....
> }
> 
> nt
> rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>                                  uint16_t port_id,
>                                  uint16_t queue_id)
> {
> ...
> /* grab the lock and change the state */
>                 rte_spinlock_lock(&q_conf->lck);
> queue_cfg->state = DISABLED;
> 
> /* wakeup if necessary */
> If (queue_cfg->wakeup_addr != NULL)
> rte_power_monitor_wakeup(queue_cfg->wakeup_addr);
> 
> rte_spinlock_unlock(&q_conf->lck);
> ...
> }
> 

Yeah, seems that i understood you correctly the first time then. I'm not 
completely convinced that this overhead and complexity is worth the 
trouble, to be honest. I mean, it's not like we're going to sleep 
indefinitely, this isn't like pthread wait - the biggest sleep time i've 
seen was around half a second and i'm not sure there is a use case for 
enabling/disabling this functionality willy nilly ever 5 seconds.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-12  9:44           ` Burakov, Anatoly
@ 2020-10-12 15:58             ` Wang, Haiyue
  0 siblings, 0 replies; 421+ messages in thread
From: Wang, Haiyue @ 2020-10-12 15:58 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Monday, October 12, 2020 17:45
> To: Wang, Haiyue <haiyue.wang@intel.com>; dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Hunt, David
> <david.hunt@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com;
> Richardson, Bruce <bruce.richardson@intel.com>; thomas@monjalon.net; McDaniel, Timothy
> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>; Macnamara, Chris
> <chris.macnamara@intel.com>
> Subject: Re: [PATCH v5 06/10] net/ixgbe: implement power management API
> 
> On 12-Oct-20 8:46 AM, Wang, Haiyue wrote:
> > Hi Liang,
> >
> >> -----Original Message-----
> >> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> >> Sent: Saturday, October 10, 2020 00:02
> >> To: dev@dpdk.org
> >> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
> >> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
> >> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce
> <bruce.richardson@intel.com>;
> >> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage
> <gage.eads@intel.com>;
> >> Macnamara, Chris <chris.macnamara@intel.com>
> >> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
> >>
> >> From: Liang Ma <liang.j.ma@intel.com>
> >>
> >> Implement support for the power management API by implementing a
> >> `get_wake_addr` function that will return an address of an RX ring's
> >> status bit.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> ---
> >>   drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
> >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
> >>   drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
> >>   3 files changed, 25 insertions(+)
> >>
> >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
> >> index 0b98e210e7..30b3f416d4 100644
> >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> >> @@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
> >>   .udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
> >>   .tm_ops_get           = ixgbe_tm_ops_get,
> >>   .tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
> >> +.get_wake_addr        = ixgbe_get_wake_addr,
> >>   };
> >>
> >>   /*
> >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> index 977ecf5137..7a9fd2aec6 100644
> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> @@ -1366,6 +1366,28 @@ const uint32_t
> >>   RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
> >>   };
> >>
> >> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> >> +uint64_t *expected, uint64_t *mask)
> >> +{
> >> +volatile union ixgbe_adv_rx_desc *rxdp;
> >> +struct ixgbe_rx_queue *rxq = rx_queue;
> >> +uint16_t desc;
> >> +
> >> +desc = rxq->rx_tail;
> >> +rxdp = &rxq->rx_ring[desc];
> >> +/* watch for changes in status bit */
> >> +*tail_desc_addr = &rxdp->wb.upper.status_error;
> >> +
> >> +/*
> >> + * we expect the DD bit to be set to 1 if this descriptor was already
> >> + * written to.
> >> + */
> >> +*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> >> +*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> >> +
> >> +return 0;
> >> +}
> >> +
> >
> > I'm wondering that whether the '.get_wake_addr' can be specific to
> > like 'rxq_tailq_addr_get' ? So that one day this wake up mechanism
> > can be applied to 'txq_tailq_addr_get' ? :-)
> 
> What would be the point of sleeping on TX queue though?

I checked, seems that the PMD uses internal index, no address, please ignore
this bad idea. ;-)

> 
> >
> > Also, "volatile void **tail_desc_addr, uint64_t *expected, uint64_t *mask"
> > can be merged into 'struct xxx' ? So that you can expand the API easily.
> 
> Actually, i don't think we can do that. Well, we can, but we'll have to
> either define a new struct for ethdev, or define it in the power library
> and make ethdev dependent on the power library. The latter is a no-go,
> and the former i don't think is a good idea because adding a new struct
> to ethdev is big deal and i'd like to avoid that if i can.

Understood the design now, thanks!

> 
> >
> > Just my thoughts.
> >
> > Anyway, LGTM
> >
> > Acked-by: Haiyue Wang <haiyue.wang@intel.com>
> >
> >> --
> >> 2.17.1
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics Anatoly Burakov
  2020-10-09 16:09         ` Jerin Jacob
@ 2020-10-12 19:47         ` David Christensen
  1 sibling, 0 replies; 421+ messages in thread
From: David Christensen @ 2020-10-12 19:47 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Liang Ma, Jan Viktorin, Ruifeng Wang, Bruce Richardson,
	Konstantin Ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, chris.macnamara



On 10/9/20 9:02 AM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> For more details, please refer to Intel(R) 64 and IA-32 Architectures
> Software Developer's Manual, Volume 2.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>      v5:
>      - Removed return values
>      - Simplified intrinsics and hardcoded C0.2 state
>      - Added other arch stubs
> 

... snip ...

> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_PPC_H_
> +#define _RTE_POWER_INTRINSIC_PPC_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * This function is not supported on PPC64.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 0 on success
> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint64_t tsc_timestamp)
> +{
> +	RTE_SET_USED(p);
> +	RTE_SET_USED(expected_value);
> +	RTE_SET_USED(value_mask);
> +	RTE_SET_USED(tsc_timestamp);
> +}
> +
> +/**
> + * This function is not supported on PPC64.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +	RTE_SET_USED(tsc_timestamp);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */

I didn't find an equivalent instruction in the current 3.1 ISA, so not 
supported is correct for POWER.

Acked-by: David Christensen <drc@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
  2020-10-11 10:07         ` Jerin Jacob
@ 2020-10-12 19:52         ` David Christensen
  1 sibling, 0 replies; 421+ messages in thread
From: David Christensen @ 2020-10-12 19:52 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Jan Viktorin, Ruifeng Wang, Ray Kinsella, Neil Horman,
	Bruce Richardson, Konstantin Ananyev, david.hunt, liang.j.ma,
	jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara



On 10/9/20 9:02 AM, Anatoly Burakov wrote:
> Currently, it is not possible to check support for intrinsics that
> are platform-specific, cannot be abstracted in a generic way, or do not
> have support on all architectures. The CPUID flags can be used to some
> extent, but they are only defined for their platform, while intrinsics
> will be available to all code as they are in generic headers.
> 
> This patch introduces infrastructure to check support for certain
> platform-specific intrinsics, and adds support for checking support for
> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---

... snip ...

> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> index 70fd7b094f..d63ad86849 100644
> --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -17,6 +17,10 @@ extern "C" {
>   /**
>    * This function is not supported on PPC64.
>    *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>    * @param p
>    *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>    * @param expected_value
> @@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
>   /**
>    * This function is not supported on PPC64.
>    *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>    * @param tsc_timestamp
>    *   Maximum TSC timestamp to wait for.
>    *
> diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
> index 3bb7563ce9..eee8234384 100644
> --- a/lib/librte_eal/ppc/rte_cpuflags.c
> +++ b/lib/librte_eal/ppc/rte_cpuflags.c
> @@ -108,3 +108,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>   		return NULL;
>   	return rte_cpu_feature_table[feature].name;
>   }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +	memset(intrinsics, 0, sizeof(*intrinsics));
> +}
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index a93dea9fe6..ed944f2bd4 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -400,6 +400,7 @@ EXPERIMENTAL {
>   	# added in 20.11
>   	__rte_eal_trace_generic_size_t;
>   	rte_service_lcore_may_be_active;
> +	rte_cpu_get_intrinsics_support;
>   };
> 
>   INTERNAL {

Acked-by: David Christensen <drc@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-12  9:28           ` Burakov, Anatoly
@ 2020-10-13  1:17             ` Wang, Haiyue
  0 siblings, 0 replies; 421+ messages in thread
From: Wang, Haiyue @ 2020-10-13  1:17 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Monday, October 12, 2020 17:29
> To: Wang, Haiyue <haiyue.wang@intel.com>; dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Hunt, David
> <david.hunt@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com;
> Richardson, Bruce <bruce.richardson@intel.com>; thomas@monjalon.net; McDaniel, Timothy
> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>; Macnamara, Chris
> <chris.macnamara@intel.com>
> Subject: Re: [PATCH v5 06/10] net/ixgbe: implement power management API
> 
> On 12-Oct-20 9:09 AM, Wang, Haiyue wrote:
> > Hi Liang,
> >
> >> -----Original Message-----
> >> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> >> Sent: Saturday, October 10, 2020 00:02
> >> To: dev@dpdk.org
> >> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
> >> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
> >> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce
> <bruce.richardson@intel.com>;
> >> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage
> <gage.eads@intel.com>;
> >> Macnamara, Chris <chris.macnamara@intel.com>
> >> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
> >>
> >> From: Liang Ma <liang.j.ma@intel.com>
> >>
> >> Implement support for the power management API by implementing a
> >> `get_wake_addr` function that will return an address of an RX ring's
> >> status bit.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> ---
> >>   drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
> >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
> >>   drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
> >>   3 files changed, 25 insertions(+)
> >>
> >
> >
> >>
> >> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> >> +uint64_t *expected, uint64_t *mask)
> >> +{
> >> +volatile union ixgbe_adv_rx_desc *rxdp;
> >> +struct ixgbe_rx_queue *rxq = rx_queue;
> >> +uint16_t desc;
> >> +
> >> +desc = rxq->rx_tail;
> >> +rxdp = &rxq->rx_ring[desc];
> >> +/* watch for changes in status bit */
> >> +*tail_desc_addr = &rxdp->wb.upper.status_error;
> >> +
> >> +/*
> >> + * we expect the DD bit to be set to 1 if this descriptor was already
> >> + * written to.
> >> + */
> >> +*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> >> +*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> >> +
> >
> > Seems have one issue about the byte endian:
> > Like for BIG endian:
> >           *expected = rte_bswap32(IXGBE_RXDADV_STAT_DD)
> >               !=
> >           *expected = rte_bswap64(IXGBE_RXDADV_STAT_DD)
> >
> > And in API 'rte_power_monitor', use uint64_t type to access the wake up
> > data:
> >
> > static inline void rte_power_monitor(const volatile void *p,
> > const uint64_t expected_value, const uint64_t value_mask,
> > const uint64_t tsc_timestamp)
> > {
> > if (value_mask) {
> > const uint64_t cur_value = *(const volatile uint64_t *)p;
> > const uint64_t masked = cur_value & value_mask;
> > /* if the masked value is already matching, abort */
> > if (masked == expected_value)
> > return;
> > }
> >
> >
> > So that we need the wake up address type like 16/32/64b ?
> 
> Endian differences strike again! You're right of course.
> 
> I suspect casting everything to CPU endinanness would fix it, would it not?

But need the same date type, if swap is needed for casting, then
(u64 a = rte_bswap32(1)) != (u64 b = rte_bswap64(1))

> 
> >
> >> --
> >> 2.17.1
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-12 13:13                             ` Burakov, Anatoly
@ 2020-10-13  9:45                               ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-13  9:45 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 12-Oct-20 2:13 PM, Burakov, Anatoly wrote:
> On 12-Oct-20 1:50 PM, Ananyev, Konstantin wrote:
>>
>>>>>
>>>>>>>>>>>> Add two new power management intrinsics, and provide an
>>>>>>>>>>>> implementation
>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The 
>>>>>>>>>>>> instructions
>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet
>>>>>>>>>>>> widespread
>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>
>>>>>>>>>>>> The power management instructions provide an 
>>>>>>>>>>>> architecture-specific
>>>>>>>>>>>> function to either wait until a specified TSC timestamp is
>>>>>>>>>>>> reached, or
>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a
>>>>>>>>>>>> memory
>>>>>>>>>>>> location is written to. The monitor function also provides an
>>>>>>>>>>>> optional
>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has 
>>>>>>>>>>>> already
>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>
>>>>>>>>>>> I think what this API is missing - a function to wakeup sleeping
>>>>>>>>>>> core.
>>>>>>>>>>> If user can/should use some system call to achieve that, then at
>>>>>>>>>>> least
>>>>>>>>>>> it has to be clearly documented, even better some wrapper 
>>>>>>>>>>> provided.
>>>>>>>>>>
>>>>>>>>>> I don't think it's possible to do that without severely
>>>>>>>>>> overcomplicating
>>>>>>>>>> the intrinsic and its usage, because AFAIK the only way to 
>>>>>>>>>> wake up a
>>>>>>>>>> sleeping core would be to send some kind of interrupt to the
>>>>>>>>>> core, or
>>>>>>>>>> trigger a write to the cache-line in question.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>>>>> (on top of my head - membarrier() does that, might be there are
>>>>>>>>> some other syscalls too),
>>>>>>>>> or something hand-made. For hand-made, I wonder would something
>>>>>>>>> like that
>>>>>>>>> be safe and sufficient:
>>>>>>>>> uint64_t val = atomic_load(addr);
>>>>>>>>> CAS(addr, val, &val);
>>>>>>>>> ?
>>>>>>>>> Anyway, one way or another - I think ability to wakeup core we put
>>>>>>>>> to sleep
>>>>>>>>> have to be an essential part of this feature.
>>>>>>>>> As I understand linux kernel will limit max amount of sleep time
>>>>>>>>> for these instructions:
>>>>>>>>> https://lwn.net/Articles/790920/
>>>>>>>>> But relying just on that, seems too vague for me:
>>>>>>>>> - user can adjust that value
>>>>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>>>>> Konstantin
>>>>>>>>>
>>>>>>>>
>>>>>>>> This implies knowing the value the core is sleeping on.
>>>>>>>
>>>>>>> You don't the value to wait for, you just need an address.
>>>>>>> And you can make wakeup function to accept address as a parameter,
>>>>>>> same as monitor() does.
>>>>>>
>>>>>> Sorry, i meant the address. We don't know the address we're 
>>>>>> sleeping on.
>>>>>>
>>>>>>>
>>>>>>>> That's not
>>>>>>>> always the case - with this particular PMD power management 
>>>>>>>> scheme, we
>>>>>>>> get the address from the PMD and it stays inside the callback.
>>>>>>>
>>>>>>> That's fine - you can store address inside you callback metadata
>>>>>>> and do wakeup as part of _disable_ function.
>>>>>>>
>>>>>>
>>>>>> The address may be different, and by the time we access the 
>>>>>> address it
>>>>>> may become stale, so i don't see how that would help unless you're
>>>>>> suggesting to have some kind of synchronization mechanism there.
>>>>>
>>>>> Yes, we'll need something to sync here for sure.
>>>>> Sorry, I should say it straightway, to avoid further misunderstanding.
>>>>> Let say, associate a spin_lock with monitor(), by analogy with
>>>>> pthread_cond_wait().
>>>>> Konstantin
>>>>>
>>>>
>>>> The idea was to provide an intrinsic-like function - as in, raw
>>>> instruction call, without anything extra. We even added the 
>>>> masks/values
>>>> etc. only because there's no race-less way to combine UMONITOR/UMWAIT
>>>> without those.
>>>>
>>>> Perhaps we can provide a synchronize-able wrapper around it to avoid
>>>> adding overhead to calls that function but doesn't need the sync 
>>>> mechanism?
>>
>> Yes, might be two flavours, something like
>> rte_power_monitor() and rte_power_monitor_sync()
>> or whatever would be a better name.
>>
>>>>
>>>
>>> Also, how would having a spinlock help to synchronize? Are you
>>> suggesting we do UMWAIT on a spinlock address, or something to that 
>>> effect?
>>>
>>
>> I thought about something very similar to cond_wait() working model:
>>
>> /*
>>   * Caller has to obtain lock before calling that function.
>>   */
>> static inline int rte_power_monitor_sync(const volatile void *p,
>>                  const uint64_t expected_value, const uint64_t 
>> value_mask,
>>                  const uint32_t state, const uint64_t tsc_timestamp, 
>> rte_spinlock_t *lck)
>> {
>> /* do whatever preparations are needed */
>>                 ....
>> umonitor(p);
>>
>> if (value_mask != 0 && *((const uint64_t *)p) & value_mask == 
>> expected_value) {
>> return 0;
>>   }
>>
>> /* release lock and go to sleep */
>> rte_spinlock_unlock(lck);
>> rflags = umwait();
>>
>> /* grab lock back after wakeup */
>> rte_spinlock_lock(lck);
>>
>> /* do rest of processing */
>> ....
>> }
>>
>> /* similar go cond_signal */
>> static inline void rte_power_monitor_wakeup(volatile void *p)
>> {
>> uint64_t v;
>>
>> v = __atomic_load_n(p, __ATOMIC_RELAXED);
>> __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, 
>> __ATOMIC_RELAXED);
>> }
>>
>>
>> Now in librte_power:
>>
>> struct pmd_queue_cfg {
>>         /* to protect state and wait_addr */
>>         rte_spinlock_t lck;
>>         enum pmd_mgmt_state pwr_mgmt_state;
>>         void *wait_addr;
>>         /* rest fields */
>>        ....
>> } __rte_cache_aligned;
>>
>>
>> static uint16_t
>> rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
>>                  struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>>                  uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> {
>>
>>          struct pmd_queue_cfg *q_conf;
>>          q_conf = &port_cfg[port_id].queue_cfg[qidx];
>>
>>          if (unlikely(nb_rx == 0)) {
>>                  q_conf->empty_poll_stats++;
>>                  if (unlikely(q_conf->empty_poll_stats > 
>> EMPTYPOLL_MAX)) {
>>                          volatile void *target_addr;
>>                          uint64_t expected, mask;
>>                          uint16_t ret;
>>
>>           /* grab the lock and check the state */
>>                         rte_spinlock_lock(&q_conf->lck);
>>           If (q-conf->state == ENABLED) {
>>                          ret = rte_eth_get_wake_addr(port_id, qidx,
>>                                                      &target_addr, 
>> &expected, &mask);
>>            If (ret == 0) {
>> q_conf->wait_addr = target_addr;
>> rte_power_monitor(target_addr, ..., &q_conf->lck);
>>           }
>>            /* reset the wait_addr */
>>            q_conf->wait_addr = NULL;
>>           }
>>           rte_spinlock_unlock(&q_conf->lck);
>>           ....
>> }
>>
>> nt
>> rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>>                                  uint16_t port_id,
>>                                  uint16_t queue_id)
>> {
>> ...
>> /* grab the lock and change the state */
>>                 rte_spinlock_lock(&q_conf->lck);
>> queue_cfg->state = DISABLED;
>>
>> /* wakeup if necessary */
>> If (queue_cfg->wakeup_addr != NULL)
>> rte_power_monitor_wakeup(queue_cfg->wakeup_addr);
>>
>> rte_spinlock_unlock(&q_conf->lck);
>> ...
>> }
>>
> 
> Yeah, seems that i understood you correctly the first time then. I'm not 
> completely convinced that this overhead and complexity is worth the 
> trouble, to be honest. I mean, it's not like we're going to sleep 
> indefinitely, this isn't like pthread wait - the biggest sleep time i've 
> seen was around half a second and i'm not sure there is a use case for 
> enabling/disabling this functionality willy nilly ever 5 seconds.
> 

Back story: we've had a little internal chat and basically agreed to 
Konstantin's proposal, with slight modifications. That is, we need to be 
able to wake up the core because otherwise we have no deterministic way 
of stopping the sleeping RX path, however there is no need to expose 
this mechanism in a public API and it can be kept inside the power 
library instead.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API Anatoly Burakov
@ 2020-10-14  3:10         ` Guo, Jia
  2020-10-14  9:07           ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Guo, Jia @ 2020-10-14  3:10 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris


> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Anatoly Burakov
> Sent: Saturday, October 10, 2020 12:02 AM
> To: dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Thomas Monjalon
> <thomas@monjalon.net>; Yigit, Ferruh <ferruh.yigit@intel.com>; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Hunt, David
> <david.hunt@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
> Bruce <bruce.richardson@intel.com>; McDaniel, Timothy
> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
> Macnamara, Chris <chris.macnamara@intel.com>
> Subject: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power
> management API
> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add a simple API to allow getting address of next RX descriptor from the
> PMD, as well as release notes information.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v5:
>     - Bring function format in line with other functions in the file
>     - Ensure the API is supported by the driver before calling it (Konstantin)
> 
>  doc/guides/rel_notes/release_20_11.rst   | 16 ++++++++++++++
>  lib/librte_ethdev/rte_ethdev.c           | 17 ++++++++++++++
>  lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_version.map |  1 +
>  5 files changed, 86 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/release_20_11.rst
> b/doc/guides/rel_notes/release_20_11.rst
> index 808bdc4e54..e85af5d3e9 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -55,6 +55,11 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
> 
> +* **ethdev: add 1 new EXPERIMENTAL API for PMD power
> management.**
> +
> +  * ``rte_eth_get_wake_addr()``
> +  * add new eth_dev_ops ``get_wake_addr``
> +
>  * **Updated Broadcom bnxt driver.**
> 
>    Updated the Broadcom bnxt driver with new features and improvements,
> including:
> @@ -136,6 +141,17 @@ New Features
>    * Extern objects and functions can be plugged into the pipeline.
>    * Transaction-oriented table updates.
> 
> +* **Add PMD power management mechanism**
> +
> +  3 new Ethernet PMD power management mechanism is added through

" mechanisms are " please.

> + existing  RX callback infrastructure.
> +
> +  * Add power saving scheme based on UMWAIT instruction (x86 only)
> +  * Add power saving scheme based on ``rte_pause()``
> +  * Add power saving scheme based on frequency scaling through the
> + power library
> +  * Add new EXPERIMENTAL API
> ``rte_power_pmd_mgmt_queue_enable()``
> +  * Add new EXPERIMENTAL API
> ``rte_power_pmd_mgmt_queue_disable()``
> +

Could this doc be separate to other specific patch if it is not related with this patch?

> 
>  Removed Items
>  -------------
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 48d1333b17..352108f43c 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -4804,6 +4804,23 @@ rte_eth_tx_burst_mode_get(uint16_t port_id,
> uint16_t queue_id,
>  		       dev->dev_ops->tx_burst_mode_get(dev, queue_id,
> mode));  }
> 
> +int
> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +		volatile void **wake_addr, uint64_t *expected, uint64_t
> *mask) {
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +
> +	dev = &rte_eth_devices[port_id];
> +
> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -
> ENOTSUP);
> +
> +	return eth_err(port_id,
> +		dev->dev_ops->get_wake_addr(dev->data-
> >rx_queues[queue_id],
> +			wake_addr, expected, mask));
> +}
> +
>  int
>  rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>  			     struct rte_ether_addr *mc_addr_set, diff --git
> a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h index
> d2bf74f128..a6cfe3cd57 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -4014,6 +4014,30 @@ __rte_experimental  int
> rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  	struct rte_eth_burst_mode *mode);
> 
> +/**
> + * Retrieve the wake up address from specific queue
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Tx queue on the Ethernet device for which information
> + *   will be retrieved.
> + * @param wake_addr
> + *   The pointer point to the address which is used for monitoring.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + *
> + * @return
> + *   - 0: Success.
> + *   -EINVAL: Failed to get wake address.
> + */

Is that "-EINVAL " is the only error value which will be return?

> +__rte_experimental
> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +			  volatile void **wake_addr,
> +			  uint64_t *expected, uint64_t *mask);
> +
>  /**
>   * Retrieve device registers and register attributes (number of registers and
>   * register size)
> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h
> b/lib/librte_ethdev/rte_ethdev_driver.h
> index c3062c246c..935d46f25c 100644
> --- a/lib/librte_ethdev/rte_ethdev_driver.h
> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
> @@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>  	 uint16_t nb_tx_desc,
>  	 const struct rte_eth_hairpin_conf *hairpin_conf);
> 
> +/**
> + * @internal
> + * Get the Wake up address.
> + *
> + * @param rxq
> + *   Ethdev queue pointer.
> + * @param tail_desc_addr
> + *   The pointer point to descriptor address var.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success.
> + * @retval -EINVAL
> + *   Failed to get descriptor address.
> + */

The question is the same as above.

> +typedef int (*eth_get_wake_addr_t)
> +	(void *rxq, volatile void **tail_desc_addr,
> +	 uint64_t *expected, uint64_t *mask);
> +
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet
> driver.
>   */
> @@ -713,6 +738,9 @@ struct eth_dev_ops {
>  	/**< Set up device RX hairpin queue. */
>  	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>  	/**< Set up device TX hairpin queue. */
> +	eth_get_wake_addr_t get_wake_addr;
> +	/**< Get wake up address. */
> +
>  };
> 
>  /**
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map
> b/lib/librte_ethdev/rte_ethdev_version.map
> index c95ef5157a..3cb2093980 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -229,6 +229,7 @@ EXPERIMENTAL {
>  	# added in 20.11
>  	rte_eth_link_speed_to_str;
>  	rte_eth_link_to_str;
> +	rte_eth_get_wake_addr;
>  };
> 
>  INTERNAL {
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 07/10] net/i40e: implement power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 07/10] net/i40e: " Anatoly Burakov
@ 2020-10-14  3:19         ` Guo, Jia
  2020-10-14  9:08           ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Guo, Jia @ 2020-10-14  3:19 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Xing, Beilei, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris


> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Saturday, October 10, 2020 12:02 AM
> To: dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Xing, Beilei <beilei.xing@intel.com>;
> Guo, Jia <jia.guo@intel.com>; Hunt, David <david.hunt@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>;
> Eads, Gage <gage.eads@intel.com>; Macnamara, Chris
> <chris.macnamara@intel.com>
> Subject: [PATCH v5 07/10] net/i40e: implement power management API
> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's status bit.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  drivers/net/i40e/i40e_ethdev.c |  1 +
>  drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
>  drivers/net/i40e/i40e_rxtx.h   |  2 ++
>  3 files changed, 26 insertions(+)
> 
> diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
> index 943cfe71dc..cab86f8ec9 100644
> --- a/drivers/net/i40e/i40e_ethdev.c
> +++ b/drivers/net/i40e/i40e_ethdev.c
> @@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
>  	.mtu_set                      = i40e_dev_mtu_set,
>  	.tm_ops_get                   = i40e_tm_ops_get,
>  	.tx_done_cleanup              = i40e_tx_done_cleanup,
> +	.get_wake_addr	              = i40e_get_wake_addr,
>  };
> 
>  /* store statistics names and its offset in stats structure */ diff --git
> a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index
> 322fc1ed75..c17f27292f 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -71,6 +71,29 @@
>  #define I40E_TX_OFFLOAD_NOTSUP_MASK \
>  		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
> 
> +int
> +i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *mask)
> +{
> +	struct i40e_rx_queue *rxq = rx_queue;
> +	volatile union i40e_rx_desc *rxdp;
> +	uint16_t desc;
> +
> +	desc = rxq->rx_tail;
> +	rxdp = &rxq->rx_ring[desc];
> +	/* watch for changes in status bit */
> +	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
> +
> +	/*
> +	 * we expect the DD bit to be set to 1 if this descriptor was already
> +	 * written to.
> +	 */
> +	*expected = rte_cpu_to_le_64(1 <<
> I40E_RX_DESC_STATUS_DD_SHIFT);
> +	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> +
> +	return 0;

Suppose that it will always success to get wake addr in i40e, right?

> +}
> +
>  static inline void
>  i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc
> *rxdp)  { diff --git a/drivers/net/i40e/i40e_rxtx.h
> b/drivers/net/i40e/i40e_rxtx.h index 57d7b4160b..f23a2073e3 100644
> --- a/drivers/net/i40e/i40e_rxtx.h
> +++ b/drivers/net/i40e/i40e_rxtx.h
> @@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void
> *rx_queue,
>  	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);  uint16_t
> i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
>  	uint16_t nb_pkts);
> +int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *value);
> 
>  /* For each value it means, datasheet of hardware can tell more details
>   *
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API
  2020-10-14  3:10         ` Guo, Jia
@ 2020-10-14  9:07           ` Burakov, Anatoly
  2020-10-14  9:15             ` Guo, Jia
  2020-10-14  9:23             ` Bruce Richardson
  0 siblings, 2 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-14  9:07 UTC (permalink / raw)
  To: Guo, Jia, dev
  Cc: Ma, Liang J, Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris

On 14-Oct-20 4:10 AM, Guo, Jia wrote:
> 
>> -----Original Message-----
>> From: dev <dev-bounces@dpdk.org> On Behalf Of Anatoly Burakov
>> Sent: Saturday, October 10, 2020 12:02 AM
>> To: dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Thomas Monjalon
>> <thomas@monjalon.net>; Yigit, Ferruh <ferruh.yigit@intel.com>; Andrew
>> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
>> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Hunt, David
>> <david.hunt@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
>> Bruce <bruce.richardson@intel.com>; McDaniel, Timothy
>> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
>> Macnamara, Chris <chris.macnamara@intel.com>
>> Subject: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power
>> management API
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Add a simple API to allow getting address of next RX descriptor from the
>> PMD, as well as release notes information.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---

Hi Jia,

Thanks for your review. Responses below.

>>
>> Notes:
>>      v5:
>>      - Bring function format in line with other functions in the file
>>      - Ensure the API is supported by the driver before calling it (Konstantin)
>>
>>   doc/guides/rel_notes/release_20_11.rst   | 16 ++++++++++++++
>>   lib/librte_ethdev/rte_ethdev.c           | 17 ++++++++++++++
>>   lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
>>   lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
>>   lib/librte_ethdev/rte_ethdev_version.map |  1 +
>>   5 files changed, 86 insertions(+)
>>
>> diff --git a/doc/guides/rel_notes/release_20_11.rst
>> b/doc/guides/rel_notes/release_20_11.rst
>> index 808bdc4e54..e85af5d3e9 100644
>> --- a/doc/guides/rel_notes/release_20_11.rst
>> +++ b/doc/guides/rel_notes/release_20_11.rst
>> @@ -55,6 +55,11 @@ New Features
>>        Also, make sure to start the actual text at the margin.
>>        =======================================================
>>
>> +* **ethdev: add 1 new EXPERIMENTAL API for PMD power
>> management.**
>> +
>> +  * ``rte_eth_get_wake_addr()``
>> +  * add new eth_dev_ops ``get_wake_addr``
>> +
>>   * **Updated Broadcom bnxt driver.**
>>
>>     Updated the Broadcom bnxt driver with new features and improvements,
>> including:
>> @@ -136,6 +141,17 @@ New Features
>>     * Extern objects and functions can be plugged into the pipeline.
>>     * Transaction-oriented table updates.
>>
>> +* **Add PMD power management mechanism**
>> +
>> +  3 new Ethernet PMD power management mechanism is added through
> 
> " mechanisms are " please.
> 
>> + existing  RX callback infrastructure.
>> +
>> +  * Add power saving scheme based on UMWAIT instruction (x86 only)
>> +  * Add power saving scheme based on ``rte_pause()``
>> +  * Add power saving scheme based on frequency scaling through the
>> + power library
>> +  * Add new EXPERIMENTAL API
>> ``rte_power_pmd_mgmt_queue_enable()``
>> +  * Add new EXPERIMENTAL API
>> ``rte_power_pmd_mgmt_queue_disable()``
>> +
> 
> Could this doc be separate to other specific patch if it is not related with this patch?

It is related - it's the doc changes that add mention of this API. I was 
under the impression current policy was having doc updates in the same 
patch as the changes made?

> 
>>
>>   Removed Items
>>   -------------
>> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
>> index 48d1333b17..352108f43c 100644
>> --- a/lib/librte_ethdev/rte_ethdev.c
>> +++ b/lib/librte_ethdev/rte_ethdev.c
>> @@ -4804,6 +4804,23 @@ rte_eth_tx_burst_mode_get(uint16_t port_id,
>> uint16_t queue_id,
>>   		       dev->dev_ops->tx_burst_mode_get(dev, queue_id,
>> mode));  }
>>
>> +int
>> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
>> +		volatile void **wake_addr, uint64_t *expected, uint64_t
>> *mask) {
>> +	struct rte_eth_dev *dev;
>> +
>> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
>> +
>> +	dev = &rte_eth_devices[port_id];
>> +
>> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -
>> ENOTSUP);
>> +
>> +	return eth_err(port_id,
>> +		dev->dev_ops->get_wake_addr(dev->data-
>>> rx_queues[queue_id],
>> +			wake_addr, expected, mask));
>> +}
>> +
>>   int
>>   rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>>   			     struct rte_ether_addr *mc_addr_set, diff --git
>> a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h index
>> d2bf74f128..a6cfe3cd57 100644
>> --- a/lib/librte_ethdev/rte_ethdev.h
>> +++ b/lib/librte_ethdev/rte_ethdev.h
>> @@ -4014,6 +4014,30 @@ __rte_experimental  int
>> rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>>   	struct rte_eth_burst_mode *mode);
>>
>> +/**
>> + * Retrieve the wake up address from specific queue
>> + *
>> + * @param port_id
>> + *   The port identifier of the Ethernet device.
>> + * @param queue_id
>> + *   The Tx queue on the Ethernet device for which information
>> + *   will be retrieved.
>> + * @param wake_addr
>> + *   The pointer point to the address which is used for monitoring.
>> + * @param expected
>> + *   The pointer point to value to be expected when descriptor is set.
>> + * @param mask
>> + *   The pointer point to comparison bitmask for the expected value.
>> + *
>> + * @return
>> + *   - 0: Success.
>> + *   -EINVAL: Failed to get wake address.
>> + */
> 
> Is that "-EINVAL " is the only error value which will be return?

Also -ENOTSUP, i'll add this, thanks.

> 
>> +__rte_experimental
>> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
>> +			  volatile void **wake_addr,
>> +			  uint64_t *expected, uint64_t *mask);
>> +
>>   /**
>>    * Retrieve device registers and register attributes (number of registers and
>>    * register size)
>> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h
>> b/lib/librte_ethdev/rte_ethdev_driver.h
>> index c3062c246c..935d46f25c 100644
>> --- a/lib/librte_ethdev/rte_ethdev_driver.h
>> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
>> @@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>>   	 uint16_t nb_tx_desc,
>>   	 const struct rte_eth_hairpin_conf *hairpin_conf);
>>
>> +/**
>> + * @internal
>> + * Get the Wake up address.
>> + *
>> + * @param rxq
>> + *   Ethdev queue pointer.
>> + * @param tail_desc_addr
>> + *   The pointer point to descriptor address var.
>> + * @param expected
>> + *   The pointer point to value to be expected when descriptor is set.
>> + * @param mask
>> + *   The pointer point to comparison bitmask for the expected value.
>> + * @return
>> + *   Negative errno value on error, 0 on success.
>> + *
>> + * @retval 0
>> + *   Success.
>> + * @retval -EINVAL
>> + *   Failed to get descriptor address.
>> + */
> 
> The question is the same as above.

This is a driver function pointer, so return value will depend on driver 
implementation. So far we only see 0 or -EINVAL values from the driver 
itself, while -ENOTSUP will be returned by ethdev in case there is no 
driver implementation of this function. So, in this case this is correct.

> 
>> +typedef int (*eth_get_wake_addr_t)
>> +	(void *rxq, volatile void **tail_desc_addr,
>> +	 uint64_t *expected, uint64_t *mask);
>> +
>> +
>>   /**
>>    * @internal A structure containing the functions exported by an Ethernet
>> driver.
>>    */
>> @@ -713,6 +738,9 @@ struct eth_dev_ops {
>>   	/**< Set up device RX hairpin queue. */
>>   	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>>   	/**< Set up device TX hairpin queue. */
>> +	eth_get_wake_addr_t get_wake_addr;
>> +	/**< Get wake up address. */
>> +
>>   };
>>
>>   /**
>> diff --git a/lib/librte_ethdev/rte_ethdev_version.map
>> b/lib/librte_ethdev/rte_ethdev_version.map
>> index c95ef5157a..3cb2093980 100644
>> --- a/lib/librte_ethdev/rte_ethdev_version.map
>> +++ b/lib/librte_ethdev/rte_ethdev_version.map
>> @@ -229,6 +229,7 @@ EXPERIMENTAL {
>>   	# added in 20.11
>>   	rte_eth_link_speed_to_str;
>>   	rte_eth_link_to_str;
>> +	rte_eth_get_wake_addr;
>>   };
>>
>>   INTERNAL {
>> --
>> 2.17.1


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 07/10] net/i40e: implement power management API
  2020-10-14  3:19         ` Guo, Jia
@ 2020-10-14  9:08           ` Burakov, Anatoly
  2020-10-14  9:17             ` Guo, Jia
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-14  9:08 UTC (permalink / raw)
  To: Guo, Jia, dev
  Cc: Ma, Liang J, Xing, Beilei, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara, Chris

On 14-Oct-20 4:19 AM, Guo, Jia wrote:
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Saturday, October 10, 2020 12:02 AM
>> To: dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Xing, Beilei <beilei.xing@intel.com>;
>> Guo, Jia <jia.guo@intel.com>; Hunt, David <david.hunt@intel.com>;
>> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
>> jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
>> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>;
>> Eads, Gage <gage.eads@intel.com>; Macnamara, Chris
>> <chris.macnamara@intel.com>
>> Subject: [PATCH v5 07/10] net/i40e: implement power management API
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Implement support for the power management API by implementing a
>> `get_wake_addr` function that will return an address of an RX ring's status bit.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   drivers/net/i40e/i40e_ethdev.c |  1 +
>>   drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
>>   drivers/net/i40e/i40e_rxtx.h   |  2 ++
>>   3 files changed, 26 insertions(+)
>>
>> diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
>> index 943cfe71dc..cab86f8ec9 100644
>> --- a/drivers/net/i40e/i40e_ethdev.c
>> +++ b/drivers/net/i40e/i40e_ethdev.c
>> @@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
>>   	.mtu_set                      = i40e_dev_mtu_set,
>>   	.tm_ops_get                   = i40e_tm_ops_get,
>>   	.tx_done_cleanup              = i40e_tx_done_cleanup,
>> +	.get_wake_addr	              = i40e_get_wake_addr,
>>   };
>>
>>   /* store statistics names and its offset in stats structure */ diff --git
>> a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index
>> 322fc1ed75..c17f27292f 100644
>> --- a/drivers/net/i40e/i40e_rxtx.c
>> +++ b/drivers/net/i40e/i40e_rxtx.c
>> @@ -71,6 +71,29 @@
>>   #define I40E_TX_OFFLOAD_NOTSUP_MASK \
>>   		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
>>
>> +int
>> +i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
>> +		uint64_t *expected, uint64_t *mask)
>> +{
>> +	struct i40e_rx_queue *rxq = rx_queue;
>> +	volatile union i40e_rx_desc *rxdp;
>> +	uint16_t desc;
>> +
>> +	desc = rxq->rx_tail;
>> +	rxdp = &rxq->rx_ring[desc];
>> +	/* watch for changes in status bit */
>> +	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
>> +
>> +	/*
>> +	 * we expect the DD bit to be set to 1 if this descriptor was already
>> +	 * written to.
>> +	 */
>> +	*expected = rte_cpu_to_le_64(1 <<
>> I40E_RX_DESC_STATUS_DD_SHIFT);
>> +	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
>> +
>> +	return 0;
> 
> Suppose that it will always success to get wake addr in i40e, right?

Yes. We've already checked all the parameters (queue etc.) in ethdev, so 
once we're here, that means there's no way this could fail as far as i 
can tell.

> 
>> +}
>> +
>>   static inline void
>>   i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc
>> *rxdp)  { diff --git a/drivers/net/i40e/i40e_rxtx.h
>> b/drivers/net/i40e/i40e_rxtx.h index 57d7b4160b..f23a2073e3 100644
>> --- a/drivers/net/i40e/i40e_rxtx.h
>> +++ b/drivers/net/i40e/i40e_rxtx.h
>> @@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void
>> *rx_queue,
>>   	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);  uint16_t
>> i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
>>   	uint16_t nb_pkts);
>> +int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
>> +		uint64_t *expected, uint64_t *value);
>>
>>   /* For each value it means, datasheet of hardware can tell more details
>>    *
>> --
>> 2.17.1


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API
  2020-10-14  9:07           ` Burakov, Anatoly
@ 2020-10-14  9:15             ` Guo, Jia
  2020-10-14  9:30               ` Burakov, Anatoly
  2020-10-14  9:23             ` Bruce Richardson
  1 sibling, 1 reply; 421+ messages in thread
From: Guo, Jia @ 2020-10-14  9:15 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris


> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Wednesday, October 14, 2020 5:07 PM
> To: Guo, Jia <jia.guo@intel.com>; dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Thomas Monjalon
> <thomas@monjalon.net>; Yigit, Ferruh <ferruh.yigit@intel.com>; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Hunt, David
> <david.hunt@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
> Bruce <bruce.richardson@intel.com>; McDaniel, Timothy
> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
> Macnamara, Chris <chris.macnamara@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power
> management API
> 
> On 14-Oct-20 4:10 AM, Guo, Jia wrote:
> >
> >> -----Original Message-----
> >> From: dev <dev-bounces@dpdk.org> On Behalf Of Anatoly Burakov
> >> Sent: Saturday, October 10, 2020 12:02 AM
> >> To: dev@dpdk.org
> >> Cc: Ma, Liang J <liang.j.ma@intel.com>; Thomas Monjalon
> >> <thomas@monjalon.net>; Yigit, Ferruh <ferruh.yigit@intel.com>; Andrew
> >> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
> >> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Hunt, David
> >> <david.hunt@intel.com>; Ananyev, Konstantin
> >> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
> >> Bruce <bruce.richardson@intel.com>; McDaniel, Timothy
> >> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
> >> Macnamara, Chris <chris.macnamara@intel.com>
> >> Subject: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power
> >> management API
> >>
> >> From: Liang Ma <liang.j.ma@intel.com>
> >>
> >> Add a simple API to allow getting address of next RX descriptor from
> >> the PMD, as well as release notes information.
> >>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> 
> Hi Jia,
> 
> Thanks for your review. Responses below.
> 
> >>
> >> Notes:
> >>      v5:
> >>      - Bring function format in line with other functions in the file
> >>      - Ensure the API is supported by the driver before calling it
> >> (Konstantin)
> >>
> >>   doc/guides/rel_notes/release_20_11.rst   | 16 ++++++++++++++
> >>   lib/librte_ethdev/rte_ethdev.c           | 17 ++++++++++++++
> >>   lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
> >>   lib/librte_ethdev/rte_ethdev_driver.h    | 28
> ++++++++++++++++++++++++
> >>   lib/librte_ethdev/rte_ethdev_version.map |  1 +
> >>   5 files changed, 86 insertions(+)
> >>
> >> diff --git a/doc/guides/rel_notes/release_20_11.rst
> >> b/doc/guides/rel_notes/release_20_11.rst
> >> index 808bdc4e54..e85af5d3e9 100644
> >> --- a/doc/guides/rel_notes/release_20_11.rst
> >> +++ b/doc/guides/rel_notes/release_20_11.rst
> >> @@ -55,6 +55,11 @@ New Features
> >>        Also, make sure to start the actual text at the margin.
> >>
> =======================================================
> >>
> >> +* **ethdev: add 1 new EXPERIMENTAL API for PMD power
> >> management.**
> >> +
> >> +  * ``rte_eth_get_wake_addr()``
> >> +  * add new eth_dev_ops ``get_wake_addr``
> >> +
> >>   * **Updated Broadcom bnxt driver.**
> >>
> >>     Updated the Broadcom bnxt driver with new features and
> >> improvements,
> >> including:
> >> @@ -136,6 +141,17 @@ New Features
> >>     * Extern objects and functions can be plugged into the pipeline.
> >>     * Transaction-oriented table updates.
> >>
> >> +* **Add PMD power management mechanism**
> >> +
> >> +  3 new Ethernet PMD power management mechanism is added through
> >
> > " mechanisms are " please.
> >
> >> + existing  RX callback infrastructure.
> >> +
> >> +  * Add power saving scheme based on UMWAIT instruction (x86 only)
> >> +  * Add power saving scheme based on ``rte_pause()``
> >> +  * Add power saving scheme based on frequency scaling through the
> >> + power library
> >> +  * Add new EXPERIMENTAL API
> >> ``rte_power_pmd_mgmt_queue_enable()``
> >> +  * Add new EXPERIMENTAL API
> >> ``rte_power_pmd_mgmt_queue_disable()``
> >> +
> >
> > Could this doc be separate to other specific patch if it is not related with this
> patch?
> 
> It is related - it's the doc changes that add mention of this API. I was under
> the impression current policy was having doc updates in the same patch as
> the changes made?
> 

Do you think this part would be better separate into [PATCH v5 05/10]?

> >
> >>
> >>   Removed Items
> >>   -------------
> >> diff --git a/lib/librte_ethdev/rte_ethdev.c
> >> b/lib/librte_ethdev/rte_ethdev.c index 48d1333b17..352108f43c 100644
> >> --- a/lib/librte_ethdev/rte_ethdev.c
> >> +++ b/lib/librte_ethdev/rte_ethdev.c
> >> @@ -4804,6 +4804,23 @@ rte_eth_tx_burst_mode_get(uint16_t port_id,
> >> uint16_t queue_id,
> >>   		       dev->dev_ops->tx_burst_mode_get(dev, queue_id,
> mode));  }
> >>
> >> +int
> >> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> >> +		volatile void **wake_addr, uint64_t *expected, uint64_t
> >> *mask) {
> >> +	struct rte_eth_dev *dev;
> >> +
> >> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> >> +
> >> +	dev = &rte_eth_devices[port_id];
> >> +
> >> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -
> >> ENOTSUP);
> >> +
> >> +	return eth_err(port_id,
> >> +		dev->dev_ops->get_wake_addr(dev->data-
> >>> rx_queues[queue_id],
> >> +			wake_addr, expected, mask));
> >> +}
> >> +
> >>   int
> >>   rte_eth_dev_set_mc_addr_list(uint16_t port_id,
> >>   			     struct rte_ether_addr *mc_addr_set, diff --git
> >> a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> >> index
> >> d2bf74f128..a6cfe3cd57 100644
> >> --- a/lib/librte_ethdev/rte_ethdev.h
> >> +++ b/lib/librte_ethdev/rte_ethdev.h
> >> @@ -4014,6 +4014,30 @@ __rte_experimental  int
> >> rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
> >>   	struct rte_eth_burst_mode *mode);
> >>
> >> +/**
> >> + * Retrieve the wake up address from specific queue
> >> + *
> >> + * @param port_id
> >> + *   The port identifier of the Ethernet device.
> >> + * @param queue_id
> >> + *   The Tx queue on the Ethernet device for which information
> >> + *   will be retrieved.
> >> + * @param wake_addr
> >> + *   The pointer point to the address which is used for monitoring.
> >> + * @param expected
> >> + *   The pointer point to value to be expected when descriptor is set.
> >> + * @param mask
> >> + *   The pointer point to comparison bitmask for the expected value.
> >> + *
> >> + * @return
> >> + *   - 0: Success.
> >> + *   -EINVAL: Failed to get wake address.
> >> + */
> >
> > Is that "-EINVAL " is the only error value which will be return?
> 
> Also -ENOTSUP, i'll add this, thanks.
> 
> >
> >> +__rte_experimental
> >> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> >> +			  volatile void **wake_addr,
> >> +			  uint64_t *expected, uint64_t *mask);
> >> +
> >>   /**
> >>    * Retrieve device registers and register attributes (number of registers
> and
> >>    * register size)
> >> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h
> >> b/lib/librte_ethdev/rte_ethdev_driver.h
> >> index c3062c246c..935d46f25c 100644
> >> --- a/lib/librte_ethdev/rte_ethdev_driver.h
> >> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
> >> @@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
> >>   	 uint16_t nb_tx_desc,
> >>   	 const struct rte_eth_hairpin_conf *hairpin_conf);
> >>
> >> +/**
> >> + * @internal
> >> + * Get the Wake up address.
> >> + *
> >> + * @param rxq
> >> + *   Ethdev queue pointer.
> >> + * @param tail_desc_addr
> >> + *   The pointer point to descriptor address var.
> >> + * @param expected
> >> + *   The pointer point to value to be expected when descriptor is set.
> >> + * @param mask
> >> + *   The pointer point to comparison bitmask for the expected value.
> >> + * @return
> >> + *   Negative errno value on error, 0 on success.
> >> + *
> >> + * @retval 0
> >> + *   Success.
> >> + * @retval -EINVAL
> >> + *   Failed to get descriptor address.
> >> + */
> >
> > The question is the same as above.
> 
> This is a driver function pointer, so return value will depend on driver
> implementation. So far we only see 0 or -EINVAL values from the driver itself,
> while -ENOTSUP will be returned by ethdev in case there is no driver
> implementation of this function. So, in this case this is correct.
> 

Ok.

> >
> >> +typedef int (*eth_get_wake_addr_t)
> >> +	(void *rxq, volatile void **tail_desc_addr,
> >> +	 uint64_t *expected, uint64_t *mask);
> >> +
> >> +
> >>   /**
> >>    * @internal A structure containing the functions exported by an
> Ethernet
> >> driver.
> >>    */
> >> @@ -713,6 +738,9 @@ struct eth_dev_ops {
> >>   	/**< Set up device RX hairpin queue. */
> >>   	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
> >>   	/**< Set up device TX hairpin queue. */
> >> +	eth_get_wake_addr_t get_wake_addr;
> >> +	/**< Get wake up address. */
> >> +
> >>   };
> >>
> >>   /**
> >> diff --git a/lib/librte_ethdev/rte_ethdev_version.map
> >> b/lib/librte_ethdev/rte_ethdev_version.map
> >> index c95ef5157a..3cb2093980 100644
> >> --- a/lib/librte_ethdev/rte_ethdev_version.map
> >> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> >> @@ -229,6 +229,7 @@ EXPERIMENTAL {
> >>   	# added in 20.11
> >>   	rte_eth_link_speed_to_str;
> >>   	rte_eth_link_to_str;
> >> +	rte_eth_get_wake_addr;
> >>   };
> >>
> >>   INTERNAL {
> >> --
> >> 2.17.1
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 07/10] net/i40e: implement power management API
  2020-10-14  9:08           ` Burakov, Anatoly
@ 2020-10-14  9:17             ` Guo, Jia
  0 siblings, 0 replies; 421+ messages in thread
From: Guo, Jia @ 2020-10-14  9:17 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Xing, Beilei, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris


> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Wednesday, October 14, 2020 5:08 PM
> To: Guo, Jia <jia.guo@intel.com>; dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Xing, Beilei <beilei.xing@intel.com>;
> Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
> Bruce <bruce.richardson@intel.com>; thomas@monjalon.net; McDaniel,
> Timothy <timothy.mcdaniel@intel.com>; Eads, Gage
> <gage.eads@intel.com>; Macnamara, Chris <chris.macnamara@intel.com>
> Subject: Re: [PATCH v5 07/10] net/i40e: implement power management API
> 
> On 14-Oct-20 4:19 AM, Guo, Jia wrote:
> >
> >> -----Original Message-----
> >> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> >> Sent: Saturday, October 10, 2020 12:02 AM
> >> To: dev@dpdk.org
> >> Cc: Ma, Liang J <liang.j.ma@intel.com>; Xing, Beilei
> >> <beilei.xing@intel.com>; Guo, Jia <jia.guo@intel.com>; Hunt, David
> >> <david.hunt@intel.com>; Ananyev, Konstantin
> >> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
> >> Bruce <bruce.richardson@intel.com>; thomas@monjalon.net; McDaniel,
> >> Timothy <timothy.mcdaniel@intel.com>; Eads, Gage
> >> <gage.eads@intel.com>; Macnamara, Chris <chris.macnamara@intel.com>
> >> Subject: [PATCH v5 07/10] net/i40e: implement power management API
> >>
> >> From: Liang Ma <liang.j.ma@intel.com>
> >>
> >> Implement support for the power management API by implementing a
> >> `get_wake_addr` function that will return an address of an RX ring's status
> bit.
> >>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>   drivers/net/i40e/i40e_ethdev.c |  1 +
> >>   drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
> >>   drivers/net/i40e/i40e_rxtx.h   |  2 ++
> >>   3 files changed, 26 insertions(+)
> >>
> >> diff --git a/drivers/net/i40e/i40e_ethdev.c
> >> b/drivers/net/i40e/i40e_ethdev.c index 943cfe71dc..cab86f8ec9 100644
> >> --- a/drivers/net/i40e/i40e_ethdev.c
> >> +++ b/drivers/net/i40e/i40e_ethdev.c
> >> @@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops
> = {
> >>   	.mtu_set                      = i40e_dev_mtu_set,
> >>   	.tm_ops_get                   = i40e_tm_ops_get,
> >>   	.tx_done_cleanup              = i40e_tx_done_cleanup,
> >> +	.get_wake_addr	              = i40e_get_wake_addr,
> >>   };
> >>
> >>   /* store statistics names and its offset in stats structure */ diff
> >> --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
> >> index 322fc1ed75..c17f27292f 100644
> >> --- a/drivers/net/i40e/i40e_rxtx.c
> >> +++ b/drivers/net/i40e/i40e_rxtx.c
> >> @@ -71,6 +71,29 @@
> >>   #define I40E_TX_OFFLOAD_NOTSUP_MASK \
> >>   		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
> >>
> >> +int
> >> +i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> >> +		uint64_t *expected, uint64_t *mask) {
> >> +	struct i40e_rx_queue *rxq = rx_queue;
> >> +	volatile union i40e_rx_desc *rxdp;
> >> +	uint16_t desc;
> >> +
> >> +	desc = rxq->rx_tail;
> >> +	rxdp = &rxq->rx_ring[desc];
> >> +	/* watch for changes in status bit */
> >> +	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
> >> +
> >> +	/*
> >> +	 * we expect the DD bit to be set to 1 if this descriptor was already
> >> +	 * written to.
> >> +	 */
> >> +	*expected = rte_cpu_to_le_64(1 <<
> >> I40E_RX_DESC_STATUS_DD_SHIFT);
> >> +	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> >> +
> >> +	return 0;
> >
> > Suppose that it will always success to get wake addr in i40e, right?
> 
> Yes. We've already checked all the parameters (queue etc.) in ethdev, so
> once we're here, that means there's no way this could fail as far as i can tell.
> 

Ok. 
Acked-by: Jeff Guo <jia.guo@intel.com>

> >
> >> +}
> >> +
> >>   static inline void
> >>   i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union
> >> i40e_rx_desc
> >> *rxdp)  { diff --git a/drivers/net/i40e/i40e_rxtx.h
> >> b/drivers/net/i40e/i40e_rxtx.h index 57d7b4160b..f23a2073e3 100644
> >> --- a/drivers/net/i40e/i40e_rxtx.h
> >> +++ b/drivers/net/i40e/i40e_rxtx.h
> >> @@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void
> >> *rx_queue,
> >>   	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);  uint16_t
> >> i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
> >>   	uint16_t nb_pkts);
> >> +int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> >> +		uint64_t *expected, uint64_t *value);
> >>
> >>   /* For each value it means, datasheet of hardware can tell more details
> >>    *
> >> --
> >> 2.17.1
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API
  2020-10-14  9:07           ` Burakov, Anatoly
  2020-10-14  9:15             ` Guo, Jia
@ 2020-10-14  9:23             ` Bruce Richardson
  1 sibling, 0 replies; 421+ messages in thread
From: Bruce Richardson @ 2020-10-14  9:23 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Guo, Jia, dev, Ma, Liang J, Thomas Monjalon, Yigit, Ferruh,
	Andrew Rybchenko, Ray Kinsella, Neil Horman, Hunt, David,
	Ananyev, Konstantin, jerinjacobk, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris

On Wed, Oct 14, 2020 at 10:07:09AM +0100, Burakov, Anatoly wrote:
> On 14-Oct-20 4:10 AM, Guo, Jia wrote:
> > 
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Anatoly Burakov
> > > Sent: Saturday, October 10, 2020 12:02 AM
> > > To: dev@dpdk.org
> > > Cc: Ma, Liang J <liang.j.ma@intel.com>; Thomas Monjalon
> > > <thomas@monjalon.net>; Yigit, Ferruh <ferruh.yigit@intel.com>; Andrew
> > > Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
> > > <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Hunt, David
> > > <david.hunt@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
> > > Bruce <bruce.richardson@intel.com>; McDaniel, Timothy
> > > <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
> > > Macnamara, Chris <chris.macnamara@intel.com>
> > > Subject: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power
> > > management API
> > > 
> > > From: Liang Ma <liang.j.ma@intel.com>
> > > 
> > > Add a simple API to allow getting address of next RX descriptor from the
> > > PMD, as well as release notes information.
> > > 
> > > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > > ---
> 
> Hi Jia,
> 
> Thanks for your review. Responses below.
> 
> > > 
> > > Notes:
> > >      v5:
> > >      - Bring function format in line with other functions in the file
> > >      - Ensure the API is supported by the driver before calling it (Konstantin)
> > > 
> > >   doc/guides/rel_notes/release_20_11.rst   | 16 ++++++++++++++
> > >   lib/librte_ethdev/rte_ethdev.c           | 17 ++++++++++++++
> > >   lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
> > >   lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
> > >   lib/librte_ethdev/rte_ethdev_version.map |  1 +
> > >   5 files changed, 86 insertions(+)
> > > 
> > > diff --git a/doc/guides/rel_notes/release_20_11.rst
> > > b/doc/guides/rel_notes/release_20_11.rst
> > > index 808bdc4e54..e85af5d3e9 100644
> > > --- a/doc/guides/rel_notes/release_20_11.rst
> > > +++ b/doc/guides/rel_notes/release_20_11.rst
> > > @@ -55,6 +55,11 @@ New Features
> > >        Also, make sure to start the actual text at the margin.
> > >        =======================================================
> > > 
> > > +* **ethdev: add 1 new EXPERIMENTAL API for PMD power
> > > management.**
> > > +
> > > +  * ``rte_eth_get_wake_addr()``
> > > +  * add new eth_dev_ops ``get_wake_addr``
> > > +
> > >   * **Updated Broadcom bnxt driver.**
> > > 
> > >     Updated the Broadcom bnxt driver with new features and improvements,
> > > including:
> > > @@ -136,6 +141,17 @@ New Features
> > >     * Extern objects and functions can be plugged into the pipeline.
> > >     * Transaction-oriented table updates.
> > > 
> > > +* **Add PMD power management mechanism**
> > > +
> > > +  3 new Ethernet PMD power management mechanism is added through
> > 
> > " mechanisms are " please.
> > 
> > > + existing  RX callback infrastructure.
> > > +
> > > +  * Add power saving scheme based on UMWAIT instruction (x86 only)
> > > +  * Add power saving scheme based on ``rte_pause()``
> > > +  * Add power saving scheme based on frequency scaling through the
> > > + power library
> > > +  * Add new EXPERIMENTAL API
> > > ``rte_power_pmd_mgmt_queue_enable()``
> > > +  * Add new EXPERIMENTAL API
> > > ``rte_power_pmd_mgmt_queue_disable()``
> > > +
> > 
> > Could this doc be separate to other specific patch if it is not related with this patch?
> 
> It is related - it's the doc changes that add mention of this API. I was
> under the impression current policy was having doc updates in the same patch
> as the changes made?
> 

Yes, that is the case. Doc changes should be made alongside the relevant
code changes.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API
  2020-10-14  9:15             ` Guo, Jia
@ 2020-10-14  9:30               ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-14  9:30 UTC (permalink / raw)
  To: Guo, Jia, dev
  Cc: Ma, Liang J, Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris

On 14-Oct-20 10:15 AM, Guo, Jia wrote:
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Wednesday, October 14, 2020 5:07 PM
>> To: Guo, Jia <jia.guo@intel.com>; dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Thomas Monjalon
>> <thomas@monjalon.net>; Yigit, Ferruh <ferruh.yigit@intel.com>; Andrew
>> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
>> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Hunt, David
>> <david.hunt@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
>> Bruce <bruce.richardson@intel.com>; McDaniel, Timothy
>> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
>> Macnamara, Chris <chris.macnamara@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power
>> management API
>>
>> On 14-Oct-20 4:10 AM, Guo, Jia wrote:
>>>
>>>> -----Original Message-----
>>>> From: dev <dev-bounces@dpdk.org> On Behalf Of Anatoly Burakov
>>>> Sent: Saturday, October 10, 2020 12:02 AM
>>>> To: dev@dpdk.org
>>>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Thomas Monjalon
>>>> <thomas@monjalon.net>; Yigit, Ferruh <ferruh.yigit@intel.com>; Andrew
>>>> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
>>>> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Hunt, David
>>>> <david.hunt@intel.com>; Ananyev, Konstantin
>>>> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
>>>> Bruce <bruce.richardson@intel.com>; McDaniel, Timothy
>>>> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
>>>> Macnamara, Chris <chris.macnamara@intel.com>
>>>> Subject: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power
>>>> management API
>>>>
>>>> From: Liang Ma <liang.j.ma@intel.com>
>>>>
>>>> Add a simple API to allow getting address of next RX descriptor from
>>>> the PMD, as well as release notes information.
>>>>
>>>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> ---
>>
>> Hi Jia,
>>
>> Thanks for your review. Responses below.
>>
>>>>
>>>> Notes:
>>>>       v5:
>>>>       - Bring function format in line with other functions in the file
>>>>       - Ensure the API is supported by the driver before calling it
>>>> (Konstantin)
>>>>
>>>>    doc/guides/rel_notes/release_20_11.rst   | 16 ++++++++++++++
>>>>    lib/librte_ethdev/rte_ethdev.c           | 17 ++++++++++++++
>>>>    lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
>>>>    lib/librte_ethdev/rte_ethdev_driver.h    | 28
>> ++++++++++++++++++++++++
>>>>    lib/librte_ethdev/rte_ethdev_version.map |  1 +
>>>>    5 files changed, 86 insertions(+)
>>>>
>>>> diff --git a/doc/guides/rel_notes/release_20_11.rst
>>>> b/doc/guides/rel_notes/release_20_11.rst
>>>> index 808bdc4e54..e85af5d3e9 100644
>>>> --- a/doc/guides/rel_notes/release_20_11.rst
>>>> +++ b/doc/guides/rel_notes/release_20_11.rst
>>>> @@ -55,6 +55,11 @@ New Features
>>>>         Also, make sure to start the actual text at the margin.
>>>>
>> =======================================================
>>>>
>>>> +* **ethdev: add 1 new EXPERIMENTAL API for PMD power
>>>> management.**
>>>> +
>>>> +  * ``rte_eth_get_wake_addr()``
>>>> +  * add new eth_dev_ops ``get_wake_addr``
>>>> +
>>>>    * **Updated Broadcom bnxt driver.**
>>>>
>>>>      Updated the Broadcom bnxt driver with new features and
>>>> improvements,
>>>> including:
>>>> @@ -136,6 +141,17 @@ New Features
>>>>      * Extern objects and functions can be plugged into the pipeline.
>>>>      * Transaction-oriented table updates.
>>>>
>>>> +* **Add PMD power management mechanism**
>>>> +
>>>> +  3 new Ethernet PMD power management mechanism is added through
>>>
>>> " mechanisms are " please.
>>>
>>>> + existing  RX callback infrastructure.
>>>> +
>>>> +  * Add power saving scheme based on UMWAIT instruction (x86 only)
>>>> +  * Add power saving scheme based on ``rte_pause()``
>>>> +  * Add power saving scheme based on frequency scaling through the
>>>> + power library
>>>> +  * Add new EXPERIMENTAL API
>>>> ``rte_power_pmd_mgmt_queue_enable()``
>>>> +  * Add new EXPERIMENTAL API
>>>> ``rte_power_pmd_mgmt_queue_disable()``
>>>> +
>>>
>>> Could this doc be separate to other specific patch if it is not related with this
>> patch?
>>
>> It is related - it's the doc changes that add mention of this API. I was under
>> the impression current policy was having doc updates in the same patch as
>> the changes made?
>>
> 
> Do you think this part would be better separate into [PATCH v5 05/10]?

Oh, sorry, you're right, these aren't related, as this functionality 
isn't in this patch yet. Will fix.

> 
>>>
>>>>
>>>>    Removed Items
>>>>    -------------
>>>> diff --git a/lib/librte_ethdev/rte_ethdev.c
>>>> b/lib/librte_ethdev/rte_ethdev.c index 48d1333b17..352108f43c 100644
>>>> --- a/lib/librte_ethdev/rte_ethdev.c
>>>> +++ b/lib/librte_ethdev/rte_ethdev.c
>>>> @@ -4804,6 +4804,23 @@ rte_eth_tx_burst_mode_get(uint16_t port_id,
>>>> uint16_t queue_id,
>>>>    		       dev->dev_ops->tx_burst_mode_get(dev, queue_id,
>> mode));  }
>>>>
>>>> +int
>>>> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
>>>> +		volatile void **wake_addr, uint64_t *expected, uint64_t
>>>> *mask) {
>>>> +	struct rte_eth_dev *dev;
>>>> +
>>>> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
>>>> +
>>>> +	dev = &rte_eth_devices[port_id];
>>>> +
>>>> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -
>>>> ENOTSUP);
>>>> +
>>>> +	return eth_err(port_id,
>>>> +		dev->dev_ops->get_wake_addr(dev->data-
>>>>> rx_queues[queue_id],
>>>> +			wake_addr, expected, mask));
>>>> +}
>>>> +
>>>>    int
>>>>    rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>>>>    			     struct rte_ether_addr *mc_addr_set, diff --git
>>>> a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
>>>> index
>>>> d2bf74f128..a6cfe3cd57 100644
>>>> --- a/lib/librte_ethdev/rte_ethdev.h
>>>> +++ b/lib/librte_ethdev/rte_ethdev.h
>>>> @@ -4014,6 +4014,30 @@ __rte_experimental  int
>>>> rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>>>>    	struct rte_eth_burst_mode *mode);
>>>>
>>>> +/**
>>>> + * Retrieve the wake up address from specific queue
>>>> + *
>>>> + * @param port_id
>>>> + *   The port identifier of the Ethernet device.
>>>> + * @param queue_id
>>>> + *   The Tx queue on the Ethernet device for which information
>>>> + *   will be retrieved.
>>>> + * @param wake_addr
>>>> + *   The pointer point to the address which is used for monitoring.
>>>> + * @param expected
>>>> + *   The pointer point to value to be expected when descriptor is set.
>>>> + * @param mask
>>>> + *   The pointer point to comparison bitmask for the expected value.
>>>> + *
>>>> + * @return
>>>> + *   - 0: Success.
>>>> + *   -EINVAL: Failed to get wake address.
>>>> + */
>>>
>>> Is that "-EINVAL " is the only error value which will be return?
>>
>> Also -ENOTSUP, i'll add this, thanks.
>>
>>>
>>>> +__rte_experimental
>>>> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
>>>> +			  volatile void **wake_addr,
>>>> +			  uint64_t *expected, uint64_t *mask);
>>>> +
>>>>    /**
>>>>     * Retrieve device registers and register attributes (number of registers
>> and
>>>>     * register size)
>>>> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h
>>>> b/lib/librte_ethdev/rte_ethdev_driver.h
>>>> index c3062c246c..935d46f25c 100644
>>>> --- a/lib/librte_ethdev/rte_ethdev_driver.h
>>>> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
>>>> @@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>>>>    	 uint16_t nb_tx_desc,
>>>>    	 const struct rte_eth_hairpin_conf *hairpin_conf);
>>>>
>>>> +/**
>>>> + * @internal
>>>> + * Get the Wake up address.
>>>> + *
>>>> + * @param rxq
>>>> + *   Ethdev queue pointer.
>>>> + * @param tail_desc_addr
>>>> + *   The pointer point to descriptor address var.
>>>> + * @param expected
>>>> + *   The pointer point to value to be expected when descriptor is set.
>>>> + * @param mask
>>>> + *   The pointer point to comparison bitmask for the expected value.
>>>> + * @return
>>>> + *   Negative errno value on error, 0 on success.
>>>> + *
>>>> + * @retval 0
>>>> + *   Success.
>>>> + * @retval -EINVAL
>>>> + *   Failed to get descriptor address.
>>>> + */
>>>
>>> The question is the same as above.
>>
>> This is a driver function pointer, so return value will depend on driver
>> implementation. So far we only see 0 or -EINVAL values from the driver itself,
>> while -ENOTSUP will be returned by ethdev in case there is no driver
>> implementation of this function. So, in this case this is correct.
>>
> 
> Ok.
> 
>>>
>>>> +typedef int (*eth_get_wake_addr_t)
>>>> +	(void *rxq, volatile void **tail_desc_addr,
>>>> +	 uint64_t *expected, uint64_t *mask);
>>>> +
>>>> +
>>>>    /**
>>>>     * @internal A structure containing the functions exported by an
>> Ethernet
>>>> driver.
>>>>     */
>>>> @@ -713,6 +738,9 @@ struct eth_dev_ops {
>>>>    	/**< Set up device RX hairpin queue. */
>>>>    	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>>>>    	/**< Set up device TX hairpin queue. */
>>>> +	eth_get_wake_addr_t get_wake_addr;
>>>> +	/**< Get wake up address. */
>>>> +
>>>>    };
>>>>
>>>>    /**
>>>> diff --git a/lib/librte_ethdev/rte_ethdev_version.map
>>>> b/lib/librte_ethdev/rte_ethdev_version.map
>>>> index c95ef5157a..3cb2093980 100644
>>>> --- a/lib/librte_ethdev/rte_ethdev_version.map
>>>> +++ b/lib/librte_ethdev/rte_ethdev_version.map
>>>> @@ -229,6 +229,7 @@ EXPERIMENTAL {
>>>>    	# added in 20.11
>>>>    	rte_eth_link_speed_to_str;
>>>>    	rte_eth_link_to_str;
>>>> +	rte_eth_get_wake_addr;
>>>>    };
>>>>
>>>>    INTERNAL {
>>>> --
>>>> 2.17.1
>>
>>
>> --
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v6 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
  2020-10-09 17:06         ` Burakov, Anatoly
@ 2020-10-14 13:30         ` Anatoly Burakov
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
                             ` (9 more replies)
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 02/10] eal: add power management intrinsics Anatoly Burakov
                           ` (8 subsequent siblings)
  10 siblings, 10 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-14 13:30 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Bruce Richardson, Konstantin Ananyev, david.hunt,
	jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a new CPUID flag indicating processor support for UMONITOR/UMWAIT
and TPAUSE instructions instruction.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v6:
    - Fix typos
    - Better commit message

 lib/librte_eal/x86/include/rte_cpuflags.h | 1 +
 lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..848ba9cbfb 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	RTE_CPUFLAG_WAITPKG,                /**< UMONITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v6 02/10] eal: add power management intrinsics
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
  2020-10-09 17:06         ` Burakov, Anatoly
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
@ 2020-10-14 13:30         ` Anatoly Burakov
  2020-10-14 17:48           ` Ananyev, Konstantin
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
                           ` (7 subsequent siblings)
  10 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-14 13:30 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jan Viktorin, Ruifeng Wang, David Christensen,
	Bruce Richardson, Konstantin Ananyev, david.hunt, jerinjacobk,
	thomas, timothy.mcdaniel, gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

For more details, please refer to Intel(R) 64 and IA-32 Architectures
Software Developer's Manual, Volume 2.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---

Notes:
    v6:
    - Add spinlock-enabled version to allow pthread-wait-like
      constructs with umwait
    - Clarify comments
    - Added experimental tags to intrinsics
    - Added endianness support
    v5:
    - Removed return values
    - Simplified intrinsics and hardcoded C0.2 state
    - Added other arch stubs

 lib/librte_eal/arm/include/meson.build        |   1 +
 .../arm/include/rte_power_intrinsics.h        |  58 ++++++++
 .../include/generic/rte_power_intrinsics.h    | 111 +++++++++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/ppc/include/meson.build        |   1 +
 .../ppc/include/rte_power_intrinsics.h        |  58 ++++++++
 lib/librte_eal/x86/include/meson.build        |   1 +
 .../x86/include/rte_power_intrinsics.h        | 132 ++++++++++++++++++
 8 files changed, 363 insertions(+)
 create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
index 73b750a18f..c6a9f70d73 100644
--- a/lib/librte_eal/arm/include/meson.build
+++ b/lib/librte_eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
 	'rte_pause_32.h',
 	'rte_pause_64.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch_32.h',
 	'rte_prefetch_64.h',
 	'rte_prefetch.h',
diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..b04ba10c76
--- /dev/null
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_ARM_H_
+#define _RTE_POWER_INTRINSIC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void rte_power_monitor_sync(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz,
+		rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..f9522f2776
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,111 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ * @param data_sz
+ *   Data size (in bytes) that will be used to compare expected value with the
+ *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
+ *   to undefined result.
+ */
+__rte_experimental
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This call will also lock a spinlock on entering sleep, and release it on
+ * waking up the CPU.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ * @param data_sz
+ *   Data size (in bytes) that will be used to compare expected value with the
+ *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
+ *   to undefined result.
+ * @param lck
+ *   A spinlock that must be locked before entering the function, will be
+ *   unlocked while the CPU is sleeping, and will be locked again once the CPU
+ *   wakes up.
+ */
+__rte_experimental
+static inline void rte_power_monitor_sync(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz,
+		rte_spinlock_t *lck);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ */
+__rte_experimental
+static inline void rte_power_pause(const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
index ab4bd28092..0873b2aecb 100644
--- a/lib/librte_eal/ppc/include/meson.build
+++ b/lib/librte_eal/ppc/include/meson.build
@@ -10,6 +10,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch.h',
 	'rte_rwlock.h',
 	'rte_spinlock.h',
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..3bceefdc3f
--- /dev/null
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_PPC_H_
+#define _RTE_POWER_INTRINSIC_PPC_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void rte_power_monitor_sync(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz,
+		rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..9ac8e6eef6
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,132 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+static inline uint64_t __get_umwait_val(const volatile void *p,
+		const uint8_t sz)
+{
+	switch (sz) {
+	case 1:
+		return *(const volatile uint8_t *)p;
+	case 2:
+		return *(const volatile uint16_t *)p;
+	case 4:
+		return *(const volatile uint32_t *)p;
+	case 8:
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+static inline void rte_power_monitor_sync(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz,
+		rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+		: /* ignore rflags */
+		: "D"(0), /* enter C0.2 */
+		  "a"(tsc_l), "d"(tsc_h));
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v6 03/10] eal: add intrinsics support check infrastructure
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
                           ` (2 preceding siblings ...)
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 02/10] eal: add power management intrinsics Anatoly Burakov
@ 2020-10-14 13:30         ` Anatoly Burakov
  2020-10-14 17:51           ` Ananyev, Konstantin
  2020-10-14 17:59           ` Jerin Jacob
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 04/10] ethdev: add simple power management API Anatoly Burakov
                           ` (6 subsequent siblings)
  10 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-14 13:30 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Ray Kinsella,
	Neil Horman, Bruce Richardson, Konstantin Ananyev, david.hunt,
	liang.j.ma, jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

Currently, it is not possible to check support for intrinsics that
are platform-specific, cannot be abstracted in a generic way, or do not
have support on all architectures. The CPUID flags can be used to some
extent, but they are only defined for their platform, while intrinsics
will be available to all code as they are in generic headers.

This patch introduces infrastructure to check support for certain
platform-specific intrinsics, and adds support for checking support for
IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---

Notes:
    v6:
    - Fix the comments

 lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
 lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
 .../include/generic/rte_power_intrinsics.h    | 12 +++++++++
 lib/librte_eal/ppc/rte_cpuflags.c             |  7 +++++
 lib/librte_eal/rte_eal_version.map            |  1 +
 lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
 6 files changed, 64 insertions(+)

diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
index 7b257b7873..e3a53bcece 100644
--- a/lib/librte_eal/arm/rte_cpuflags.c
+++ b/lib/librte_eal/arm/rte_cpuflags.c
@@ -151,3 +151,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
index 872f0ebe3e..28a5aecde8 100644
--- a/lib/librte_eal/include/generic/rte_cpuflags.h
+++ b/lib/librte_eal/include/generic/rte_cpuflags.h
@@ -13,6 +13,32 @@
 #include "rte_common.h"
 #include <errno.h>
 
+#include <rte_compat.h>
+
+/**
+ * Structure used to describe platform-specific intrinsics that may or may not
+ * be supported at runtime.
+ */
+struct rte_cpu_intrinsics {
+	uint32_t power_monitor : 1;
+	/**< indicates support for rte_power_monitor function */
+	uint32_t power_pause : 1;
+	/**< indicates support for rte_power_pause function */
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Check CPU support for various intrinsics at runtime.
+ *
+ * @param intrinsics
+ *     Pointer to a structure to be filled.
+ */
+__rte_experimental
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
+
 /**
  * Enumeration of all CPU features supported
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index f9522f2776..1c176e4ef5 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -32,6 +32,10 @@
  * checked against the expected value, and if they match, the entering of
  * optimized power state may be aborted.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
  * @param expected_value
@@ -69,6 +73,10 @@ static inline void rte_power_monitor(const volatile void *p,
  * This call will also lock a spinlock on entering sleep, and release it on
  * waking up the CPU.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
  * @param expected_value
@@ -101,6 +109,10 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  * Enter an architecture-defined optimized power state until a certain TSC
  * timestamp is reached.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
index 3bb7563ce9..61db5c216d 100644
--- a/lib/librte_eal/ppc/rte_cpuflags.c
+++ b/lib/librte_eal/ppc/rte_cpuflags.c
@@ -8,6 +8,7 @@
 #include <elf.h>
 #include <fcntl.h>
 #include <assert.h>
+#include <string.h>
 #include <unistd.h>
 
 /* Symbolic values for the entries in the auxiliary table */
@@ -108,3 +109,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index a93dea9fe6..ed944f2bd4 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -400,6 +400,7 @@ EXPERIMENTAL {
 	# added in 20.11
 	__rte_eal_trace_generic_size_t;
 	rte_service_lcore_may_be_active;
+	rte_cpu_get_intrinsics_support;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 0325c4b93b..a96312ff7f 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -7,6 +7,7 @@
 #include <stdio.h>
 #include <errno.h>
 #include <stdint.h>
+#include <string.h>
 
 #include "rte_cpuid.h"
 
@@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
+		intrinsics->power_monitor = 1;
+		intrinsics->power_pause = 1;
+	}
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v6 04/10] ethdev: add simple power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
                           ` (3 preceding siblings ...)
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
@ 2020-10-14 13:30         ` Anatoly Burakov
  2020-10-14 17:06           ` Ananyev, Konstantin
  2020-10-15 11:29           ` Liang, Ma
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback Anatoly Burakov
                           ` (5 subsequent siblings)
  10 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-14 13:30 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple API to allow getting address of next RX descriptor from the
PMD, as well as release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v6:
    - Rebase on top of latest main
    - Ensure the API checks queue ID (Konstantin)
    - Removed accidental inclusion of unrelated release notes
    v5:
    - Bring function format in line with other functions in the file
    - Ensure the API is supported by the driver before calling it (Konstantin)

 doc/guides/rel_notes/release_20_11.rst   |  8 ++++++-
 lib/librte_ethdev/rte_ethdev.c           | 23 +++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h           | 26 ++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_version.map |  1 +
 5 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 0925123e9c..ca4f43f7f9 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -71,7 +71,13 @@ New Features
 * **Added the FEC API, for a generic FEC query and config.**
 
   Added the FEC API which provides functions for query FEC capabilities and
-  current FEC mode from device. Also, API for configuring FEC mode is also provided.
+  current FEC mode from device. Also, API for configuring FEC mode is also
+  provided.
+
+* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**
+
+  * ``rte_eth_get_wake_addr()``
+  * add new eth_dev_ops ``get_wake_addr``
 
 * **Updated Broadcom bnxt driver.**
 
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 59beb8aec2..5d19c97a3e 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -4844,6 +4844,29 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_tx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid TX queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_wake_addr(dev->data->rx_queues[queue_id],
+			wake_addr, expected, mask, data_sz));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 3a31f94367..005faba455 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4119,6 +4119,32 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * Retrieve the wake up address for the receive queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param wake_addr
+ *   The pointer point to the address which is used for monitoring.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index 35cc4fb186..76b179de42 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -655,6 +655,32 @@ typedef int (*eth_fec_get_t)(struct rte_eth_dev *dev,
  */
 typedef int (*eth_fec_set_t)(struct rte_eth_dev *dev, uint32_t fec_capa);
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an RX queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param tail_desc_addr
+ *   The pointer point to where the address will be stored.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ * @param data_sz
+ *   Data size for the expected value (can be 1, 2, 4, or 8 bytes)
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -801,6 +827,8 @@ struct eth_dev_ops {
 	/**< Get Forward Error Correction(FEC) mode. */
 	eth_fec_set_t fec_set;
 	/**< Set Forward Error Correction(FEC) mode. */
+	eth_get_wake_addr_t get_wake_addr;
+	/**< Get next RX queue ring entry address. */
 };
 
 /**
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index f8a0945812..6c2ea5996d 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -232,6 +232,7 @@ EXPERIMENTAL {
 	rte_eth_fec_get_capability;
 	rte_eth_fec_get;
 	rte_eth_fec_set;
+	rte_eth_get_wake_addr;
 };
 
 INTERNAL {
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
                           ` (4 preceding siblings ...)
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 04/10] ethdev: add simple power management API Anatoly Burakov
@ 2020-10-14 13:30         ` Anatoly Burakov
  2020-10-14 14:19           ` David Hunt
  2020-10-14 18:41           ` Ananyev, Konstantin
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 06/10] net/ixgbe: implement power management API Anatoly Burakov
                           ` (4 subsequent siblings)
  10 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-14 13:30 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman,
	konstantin.ananyev, jerinjacobk, bruce.richardson, thomas,
	timothy.mcdaniel, gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. Pause instruction

   Instead of move the core into deeper C state, this method uses the
   pause instruction to avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v6:
    - Added wakeup mechanism for UMWAIT
    - Removed memory allocation (everything is now allocated statically)
    - Fixed various typos and comments
    - Check for invalid queue ID
    - Moved release notes to this patch
    
    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
    - Add some debug logging
    - Replace x86-specific code path to generic path using the intrinsic check

 doc/guides/rel_notes/release_20_11.rst |  11 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 300 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  92 ++++++++
 lib/librte_power/rte_power_version.map |   4 +
 5 files changed, 410 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index ca4f43f7f9..06b822aa36 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -197,6 +197,17 @@ New Features
   * Added new ``RTE_ACL_CLASSIFY_AVX512X32`` vector implementation,
     which can process up to 32 flows in parallel. Requires AVX512 support.
 
+* **Add PMD power management mechanism**
+
+  3 new Ethernet PMD power management mechanism is added through existing
+  RX callback infrastructure.
+
+  * Add power saving scheme based on UMWAIT instruction (x86 only)
+  * Add power saving scheme based on ``rte_pause()``
+  * Add power saving scheme based on frequency scaling through the power library
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..2b7d2a1a46
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,300 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	rte_spinlock_t umwait_lock;
+	/**< Per-queue status lock - used only for UMWAIT mode */
+	volatile void *wait_addr;
+	/**< UMWAIT wakeup address */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+/* trigger a write to the cache line we're waiting on */
+static void umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+			uint8_t data_sz;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = rte_eth_get_wake_addr(port_id, qidx,
+					&target_addr, &expected, &mask,
+					&data_sz);
+			if (ret == 0) {
+				/*
+				 * we need to ensure we can wake up by another
+				 * thread triggering a write, so we need the
+				 * address to always be up to date.
+				 */
+				rte_spinlock_lock(&q_conf->umwait_lock);
+				q_conf->wait_addr = target_addr;
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor_sync(target_addr, expected,
+						mask, -1ULL, data_sz,
+						&q_conf->umwait_lock);
+				/* erase the address */
+				q_conf->wait_addr = NULL;
+				rte_spinlock_unlock(&q_conf->umwait_lock);
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			rte_delay_us(1);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* check if queue id is valid */
+	if (queue_id >= dev->data->nb_rx_queues ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		return -EINVAL;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	{
+		/* check if rte_power_monitor is supported */
+		uint64_t dummy_expected, dummy_mask;
+		struct rte_cpu_intrinsics i;
+		volatile void *dummy_addr;
+		uint8_t dummy_sz;
+
+		rte_cpu_get_intrinsics_support(&i);
+
+		if (!i.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_wake_addr(port_id, queue_id,
+				&dummy_addr, &dummy_expected,
+				&dummy_mask, &dummy_sz) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize UMWAIT spinlock */
+		rte_spinlock_init(&queue_cfg->umwait_lock);
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+		rte_spinlock_lock(&queue_cfg->umwait_lock);
+
+		/* wake up the core from UMWAIT sleep, if any */
+		if (queue_cfg->wait_addr != NULL)
+			umwait_wakeup(queue_cfg->wait_addr);
+
+		rte_spinlock_unlock(&queue_cfg->umwait_lock);
+		/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	ret = 0;
+end:
+	return ret;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..a7a3f98268
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v6 06/10] net/ixgbe: implement power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
                           ` (5 preceding siblings ...)
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback Anatoly Burakov
@ 2020-10-14 13:30         ` Anatoly Burakov
  2020-10-14 17:04           ` Wang, Haiyue
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 07/10] net/i40e: " Anatoly Burakov
                           ` (3 subsequent siblings)
  10 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-14 13:30 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jeff Guo, Haiyue Wang, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 0b98e210e7..30b3f416d4 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_wake_addr        = ixgbe_get_wake_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 29d385c062..b1d656d270 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	*data_sz = 4;
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 0b5589ef4d..6b9afec655 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v6 07/10] net/i40e: implement power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
                           ` (6 preceding siblings ...)
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 06/10] net/ixgbe: implement power management API Anatoly Burakov
@ 2020-10-14 13:30         ` Anatoly Burakov
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 08/10] net/ice: " Anatoly Burakov
                           ` (2 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-14 13:30 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Beilei Xing, Jeff Guo, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 943cfe71dc..cab86f8ec9 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_wake_addr	              = i40e_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index f2844d3f74..cdb1cd494b 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,32 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	*data_sz = 8;
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..5826cf1099 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value, uint8_t *data_sz);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v6 08/10] net/ice: implement power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
                           ` (7 preceding siblings ...)
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 07/10] net/i40e: " Anatoly Burakov
@ 2020-10-14 13:30         ` Anatoly Burakov
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 10/10] doc: update programmer's guide for power library Anatoly Burakov
  10 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-14 13:30 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Qiming Yang, Qi Zhang, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index d8ce09d28f..260de5dfd7 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_wake_addr	              = ice_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 79e6df11f4..7c0f963d96 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -25,6 +25,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	*data_sz = 2;
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1c23c7541e..7eeb8d467e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v6 09/10] examples/l3fwd-power: enable PMD power mgmt
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
                           ` (8 preceding siblings ...)
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 08/10] net/ice: " Anatoly Burakov
@ 2020-10-14 13:30         ` Anatoly Burakov
  2020-10-14 14:24           ` David Hunt
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 10/10] doc: update programmer's guide for power library Anatoly Burakov
  10 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-14 13:30 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, konstantin.ananyev, jerinjacobk,
	bruce.richardson, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v6:
    - Fixed typos in documentation

 .../sample_app_ug/l3_forward_power_man.rst    | 13 ++++++
 examples/l3fwd-power/main.c                   | 41 ++++++++++++++++++-
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 0cc6f2e62e..2767fb91aa 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -459,3 +461,14 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index d0e6c9bd77..af64dd521f 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,7 +200,8 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
@@ -1750,6 +1752,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1771,6 +1774,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1881,6 +1885,16 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt  mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2437,6 +2451,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2705,6 +2721,12 @@ main(int argc, char **argv)
 			} else if (!check_ptype(portid))
 				rte_exit(EXIT_FAILURE,
 					 "PMD can not provide needed ptypes\n");
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				rte_power_pmd_mgmt_queue_enable(lcore_id,
+							portid, queueid,
+						RTE_POWER_MGMT_TYPE_SCALE);
+
+			}
 		}
 	}
 
@@ -2790,6 +2812,9 @@ main(int argc, char **argv)
 						SKIP_MASTER);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MASTER);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL,
+					 CALL_MASTER);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2812,6 +2837,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v6 10/10] doc: update programmer's guide for power library
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
                           ` (9 preceding siblings ...)
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2020-10-14 13:30         ` Anatoly Burakov
  2020-10-14 14:27           ` David Hunt
  10 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-14 13:30 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, konstantin.ananyev, jerinjacobk,
	bruce.richardson, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Update programmer's guide to document PMD power management usage.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Moved l3fwd-power update to the l3fwd-power-related commit
    - Some rewordings and clarifications

 doc/guides/prog_guide/power_man.rst | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..38c64d31e4 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,45 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * UMWAIT/UMONITOR
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX descriptor
+   address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will use the `rte_pause` function to avoid busy
+   polling.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing power library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +239,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback Anatoly Burakov
@ 2020-10-14 14:19           ` David Hunt
  2020-10-14 18:41           ` Ananyev, Konstantin
  1 sibling, 0 replies; 421+ messages in thread
From: David Hunt @ 2020-10-14 14:19 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Liang Ma, Ray Kinsella, Neil Horman, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara


On 14/10/2020 2:30 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
>
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
>
> This API mandates a core-to-single-queue mapping (that is, multiple
> queued per device are supported, but they have to be polled on different
> cores).
>
> This design is using PMD RX callbacks.
>
> 1. UMWAIT/UMONITOR:
>
>     When a certain threshold of empty polls is reached, the core will go
>     into a power optimized sleep while waiting on an address of next RX
>     descriptor to be written to.
>
> 2. Pause instruction
>
>     Instead of move the core into deeper C state, this method uses the
>     pause instruction to avoid busy polling.
>
> 3. Frequency scaling
>     Reuse existing DPDK power library to scale up/down core frequency
>     depending on traffic volume.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---


Hi Liang, Anatoly, great work on the patch set.


Acked-by: David Hunt <david.hunt@intel.com>





^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 09/10] examples/l3fwd-power: enable PMD power mgmt
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2020-10-14 14:24           ` David Hunt
  0 siblings, 0 replies; 421+ messages in thread
From: David Hunt @ 2020-10-14 14:24 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Liang Ma, konstantin.ananyev, jerinjacobk, bruce.richardson,
	thomas, timothy.mcdaniel, gage.eads, chris.macnamara


On 14/10/2020 2:30 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
>
> Add PMD power management feature support to l3fwd-power sample app.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
>      v6:
>      - Fixed typos in documentation
>
>   .../sample_app_ug/l3_forward_power_man.rst    | 13 ++++++
>   examples/l3fwd-power/main.c                   | 41 ++++++++++++++++++-
>   2 files changed, 53 insertions(+), 1 deletion(-)
>

Acked-by: David Hunt <david.hunt@intel.com>



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 10/10] doc: update programmer's guide for power library
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 10/10] doc: update programmer's guide for power library Anatoly Burakov
@ 2020-10-14 14:27           ` David Hunt
  0 siblings, 0 replies; 421+ messages in thread
From: David Hunt @ 2020-10-14 14:27 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Liang Ma, konstantin.ananyev, jerinjacobk, bruce.richardson,
	thomas, timothy.mcdaniel, gage.eads, chris.macnamara


On 14/10/2020 2:30 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
>
> Update programmer's guide to document PMD power management usage.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
>      v5:
>      - Moved l3fwd-power update to the l3fwd-power-related commit
>      - Some rewordings and clarifications
>
>   doc/guides/prog_guide/power_man.rst | 42 +++++++++++++++++++++++++++++
>   1 file changed, 42 insertions(+)
>

Acked-by: David Hunt <david.hunt@intel.com>



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 06/10] net/ixgbe: implement power management API
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 06/10] net/ixgbe: implement power management API Anatoly Burakov
@ 2020-10-14 17:04           ` Wang, Haiyue
  0 siblings, 0 replies; 421+ messages in thread
From: Wang, Haiyue @ 2020-10-14 17:04 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Wednesday, October 14, 2020 21:30
> To: dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
> Macnamara, Chris <chris.macnamara@intel.com>
> Subject: [PATCH v6 06/10] net/ixgbe: implement power management API
> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's
> status bit.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
>  drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>  3 files changed, 28 insertions(+)

Reviewed-by: Haiyue Wang <haiyue.wang@intel.com>

> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 04/10] ethdev: add simple power management API
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 04/10] ethdev: add simple power management API Anatoly Burakov
@ 2020-10-14 17:06           ` Ananyev, Konstantin
  2020-10-15 11:29           ` Liang, Ma
  1 sibling, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-14 17:06 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, Hunt, David, jerinjacobk, Richardson,
	Bruce, McDaniel, Timothy, Eads, Gage, Macnamara,  Chris



> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add a simple API to allow getting address of next RX descriptor from the
> PMD, as well as release notes information.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
> 
> Notes:
>     v6:
>     - Rebase on top of latest main
>     - Ensure the API checks queue ID (Konstantin)
>     - Removed accidental inclusion of unrelated release notes
>     v5:
>     - Bring function format in line with other functions in the file
>     - Ensure the API is supported by the driver before calling it (Konstantin)
> 
>  doc/guides/rel_notes/release_20_11.rst   |  8 ++++++-
>  lib/librte_ethdev/rte_ethdev.c           | 23 +++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev.h           | 26 ++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_version.map |  1 +
>  5 files changed, 85 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> index 0925123e9c..ca4f43f7f9 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -71,7 +71,13 @@ New Features
>  * **Added the FEC API, for a generic FEC query and config.**
> 
>    Added the FEC API which provides functions for query FEC capabilities and
> -  current FEC mode from device. Also, API for configuring FEC mode is also provided.
> +  current FEC mode from device. Also, API for configuring FEC mode is also
> +  provided.
> +
> +* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**
> +
> +  * ``rte_eth_get_wake_addr()``
> +  * add new eth_dev_ops ``get_wake_addr``
> 
>  * **Updated Broadcom bnxt driver.**
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 59beb8aec2..5d19c97a3e 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -4844,6 +4844,29 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
>  }
> 
> +int
> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
> +		uint8_t *data_sz)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +
> +	dev = &rte_eth_devices[port_id];
> +
> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -ENOTSUP);
> +
> +	if (queue_id >= dev->data->nb_tx_queues) {

nb_rx_queues

> +		RTE_ETHDEV_LOG(ERR, "Invalid TX queue_id=%u\n", queue_id);

RX not TX

> +		return -EINVAL;
> +	}
> +
> +	return eth_err(port_id,
> +		dev->dev_ops->get_wake_addr(dev->data->rx_queues[queue_id],
> +			wake_addr, expected, mask, data_sz));
> +}
> +
>  int
>  rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>  			     struct rte_ether_addr *mc_addr_set,
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 3a31f94367..005faba455 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -4119,6 +4119,32 @@ __rte_experimental
>  int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  	struct rte_eth_burst_mode *mode);
> 
> +/**
> + * Retrieve the wake up address for the receive queue.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Rx queue on the Ethernet device for which information
> + *   will be retrieved.
> + * @param wake_addr
> + *   The pointer point to the address which is used for monitoring.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + *
> + * @return
> + *   - 0: Success.
> + *   -ENOTSUP: Operation not supported.
> + *   -EINVAL: Invalid parameters.
> + *   -ENODEV: Invalid port ID.
> + */
> +__rte_experimental
> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
> +		uint8_t *data_sz);
> +
>  /**
>   * Retrieve device registers and register attributes (number of registers and
>   * register size)
> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
> index 35cc4fb186..76b179de42 100644
> --- a/lib/librte_ethdev/rte_ethdev_driver.h
> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
> @@ -655,6 +655,32 @@ typedef int (*eth_fec_get_t)(struct rte_eth_dev *dev,
>   */
>  typedef int (*eth_fec_set_t)(struct rte_eth_dev *dev, uint32_t fec_capa);
> 
> +/**
> + * @internal
> + * Get address of memory location whose contents will change whenever there is
> + * new data to be received on an RX queue.
> + *
> + * @param rxq
> + *   Ethdev queue pointer.
> + * @param tail_desc_addr
> + *   The pointer point to where the address will be stored.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + * @param data_sz
> + *   Data size for the expected value (can be 1, 2, 4, or 8 bytes)
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success
> + * @retval -EINVAL
> + *   Invalid parameters
> + */
> +typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -801,6 +827,8 @@ struct eth_dev_ops {
>  	/**< Get Forward Error Correction(FEC) mode. */
>  	eth_fec_set_t fec_set;
>  	/**< Set Forward Error Correction(FEC) mode. */
> +	eth_get_wake_addr_t get_wake_addr;
> +	/**< Get next RX queue ring entry address. */
>  };
> 
>  /**
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> index f8a0945812..6c2ea5996d 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -232,6 +232,7 @@ EXPERIMENTAL {
>  	rte_eth_fec_get_capability;
>  	rte_eth_fec_get;
>  	rte_eth_fec_set;
> +	rte_eth_get_wake_addr;
>  };
> 
>  INTERNAL {
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 02/10] eal: add power management intrinsics
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 02/10] eal: add power management intrinsics Anatoly Burakov
@ 2020-10-14 17:48           ` Ananyev, Konstantin
  2020-10-15 10:09             ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-14 17:48 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Jan Viktorin, Ruifeng Wang, David Christensen,
	Richardson, Bruce, Hunt, David, jerinjacobk, thomas, McDaniel,
	Timothy, Eads, Gage, Macnamara, Chris



> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> For more details, please refer to Intel(R) 64 and IA-32 Architectures
> Software Developer's Manual, Volume 2.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> ---
> 
> Notes:
>     v6:
>     - Add spinlock-enabled version to allow pthread-wait-like
>       constructs with umwait
>     - Clarify comments
>     - Added experimental tags to intrinsics
>     - Added endianness support
>     v5:
>     - Removed return values
>     - Simplified intrinsics and hardcoded C0.2 state
>     - Added other arch stubs
> 
>  lib/librte_eal/arm/include/meson.build        |   1 +
>  .../arm/include/rte_power_intrinsics.h        |  58 ++++++++
>  .../include/generic/rte_power_intrinsics.h    | 111 +++++++++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/ppc/include/meson.build        |   1 +
>  .../ppc/include/rte_power_intrinsics.h        |  58 ++++++++
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 132 ++++++++++++++++++
>  8 files changed, 363 insertions(+)
>  create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
> 
> diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
> index 73b750a18f..c6a9f70d73 100644
> --- a/lib/librte_eal/arm/include/meson.build
> +++ b/lib/librte_eal/arm/include/meson.build
> @@ -20,6 +20,7 @@ arch_headers = files(
>  	'rte_pause_32.h',
>  	'rte_pause_64.h',
>  	'rte_pause.h',
> +	'rte_power_intrinsics.h',
>  	'rte_prefetch_32.h',
>  	'rte_prefetch_64.h',
>  	'rte_prefetch.h',
> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..b04ba10c76
> --- /dev/null
> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> @@ -0,0 +1,58 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_ARM_H_
> +#define _RTE_POWER_INTRINSIC_ARM_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * This function is not supported on ARM.
> + */

Here and in other places - please follow dpdk coding convention
for function definitions, i.e:
static inline void
rte_power_monitor(... 

> +static inline void rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint64_t tsc_timestamp, const uint8_t data_sz)
> +{
> +	RTE_SET_USED(p);
> +	RTE_SET_USED(expected_value);
> +	RTE_SET_USED(value_mask);
> +	RTE_SET_USED(tsc_timestamp);
> +	RTE_SET_USED(data_sz);
> +}

You can probably put NOP implementations of these rte_powe_* functions
into generic/rte_power_intrinsics.h.
So, wouldn't need to duplicate them for every non-supported arch.
Same as it was done for rte_wait_until_equal_*().

> +
> +/**
> + * This function is not supported on ARM.
> + */
> +static inline void rte_power_monitor_sync(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint64_t tsc_timestamp, const uint8_t data_sz,
> +		rte_spinlock_t *lck)
> +{
> +	RTE_SET_USED(p);
> +	RTE_SET_USED(expected_value);
> +	RTE_SET_USED(value_mask);
> +	RTE_SET_USED(tsc_timestamp);
> +	RTE_SET_USED(lck);
> +	RTE_SET_USED(data_sz);
> +}
> +
> +/**
> + * This function is not supported on ARM.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +	RTE_SET_USED(tsc_timestamp);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..f9522f2776
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,111 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +#include <rte_compat.h>
> +#include <rte_spinlock.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, a certain TSC timestamp is reached, or other
> + * reasons cause the CPU to wake up.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.

Is 64B alignment really needed?


> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + * @param data_sz
> + *   Data size (in bytes) that will be used to compare expected value with the
> + *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
> + *   to undefined result.
> + */
> +__rte_experimental
> +static inline void rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint64_t tsc_timestamp, const uint8_t data_sz);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, a certain TSC timestamp is reached, or other
> + * reasons cause the CPU to wake up.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This call will also lock a spinlock on entering sleep, and release it on
> + * waking up the CPU.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + * @param data_sz
> + *   Data size (in bytes) that will be used to compare expected value with the
> + *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
> + *   to undefined result.
> + * @param lck
> + *   A spinlock that must be locked before entering the function, will be
> + *   unlocked while the CPU is sleeping, and will be locked again once the CPU
> + *   wakes up.
> + */
> +__rte_experimental
> +static inline void rte_power_monitor_sync(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint64_t tsc_timestamp, const uint8_t data_sz,
> +		rte_spinlock_t *lck);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + */
> +__rte_experimental
> +static inline void rte_power_pause(const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index cd09027958..3a12e87e19 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -60,6 +60,7 @@ generic_headers = files(
>  	'generic/rte_memcpy.h',
>  	'generic/rte_pause.h',
>  	'generic/rte_prefetch.h',
> +	'generic/rte_power_intrinsics.h',
>  	'generic/rte_rwlock.h',
>  	'generic/rte_spinlock.h',
>  	'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
> index ab4bd28092..0873b2aecb 100644
> --- a/lib/librte_eal/ppc/include/meson.build
> +++ b/lib/librte_eal/ppc/include/meson.build
> @@ -10,6 +10,7 @@ arch_headers = files(
>  	'rte_io.h',
>  	'rte_memcpy.h',
>  	'rte_pause.h',
> +	'rte_power_intrinsics.h',
>  	'rte_prefetch.h',
>  	'rte_rwlock.h',
>  	'rte_spinlock.h',
> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..3bceefdc3f
> --- /dev/null
> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -0,0 +1,58 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_PPC_H_
> +#define _RTE_POWER_INTRINSIC_PPC_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * This function is not supported on PPC64.
> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint64_t tsc_timestamp, const uint8_t data_sz)
> +{
> +	RTE_SET_USED(p);
> +	RTE_SET_USED(expected_value);
> +	RTE_SET_USED(value_mask);
> +	RTE_SET_USED(tsc_timestamp);
> +	RTE_SET_USED(data_sz);
> +}
> +
> +/**
> + * This function is not supported on PPC64.
> + */
> +static inline void rte_power_monitor_sync(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint64_t tsc_timestamp, const uint8_t data_sz,
> +		rte_spinlock_t *lck)
> +{
> +	RTE_SET_USED(p);
> +	RTE_SET_USED(expected_value);
> +	RTE_SET_USED(value_mask);
> +	RTE_SET_USED(tsc_timestamp);
> +	RTE_SET_USED(lck);
> +	RTE_SET_USED(data_sz);
> +}
> +
> +/**
> + * This function is not supported on PPC64.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +	RTE_SET_USED(tsc_timestamp);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>  	'rte_io.h',
>  	'rte_memcpy.h',
>  	'rte_prefetch.h',
> +	'rte_power_intrinsics.h',
>  	'rte_pause.h',
>  	'rte_rtm.h',
>  	'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..9ac8e6eef6
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,132 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_

Why '_64_H'?
My understanding was these ops are supported 32-bit mode too. 

> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +static inline uint64_t __get_umwait_val(const volatile void *p,
> +		const uint8_t sz)
> +{
> +	switch (sz) {
> +	case 1:

Just as a nit:
case sizeof(type_x):
	return *(const volatile type_x *)p;

> +		return *(const volatile uint8_t *)p;
> +	case 2:
> +		return *(const volatile uint16_t *)p;
> +	case 4:
> +		return *(const volatile uint32_t *)p;
> +	case 8:
> +		return *(const volatile uint64_t *)p;
> +	default:
> +		/* this is an intrinsic, so we can't have any error handling */

RTE_ASSERT(0); ?

> +		return 0;
> +	}
> +}
> +
> +/**
> + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
> + * For more information about usage of these instructions, please refer to
> + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint64_t tsc_timestamp, const uint8_t data_sz)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +
> +	if (value_mask) {
> +		const uint64_t cur_value = __get_umwait_val(p, data_sz);
> +		const uint64_t masked = cur_value & value_mask;
> +
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
> +			: /* ignore rflags */
> +			: "D"(0), /* enter C0.2 */
> +			  "a"(tsc_l), "d"(tsc_h));
> +}
> +
> +/**
> + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
> + * For more information about usage of these instructions, please refer to
> + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
> + */
> +static inline void rte_power_monitor_sync(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint64_t tsc_timestamp, const uint8_t data_sz,
> +		rte_spinlock_t *lck)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +
> +	if (value_mask) {
> +		const uint64_t cur_value = __get_umwait_val(p, data_sz);
> +		const uint64_t masked = cur_value & value_mask;
> +
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return;
> +	}
> +	rte_spinlock_unlock(lck);
> +
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
> +			: /* ignore rflags */
> +			: "D"(0), /* enter C0.2 */
> +			  "a"(tsc_l), "d"(tsc_h));
> +
> +	rte_spinlock_lock(lck);
> +}
> +
> +/**
> + * This function uses TPAUSE instruction  and will enter C0.2 state. For more
> + * information about usage of this instruction, please refer to Intel(R) 64 and
> + * IA-32 Architectures Software Developer's Manual.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
> +		: /* ignore rflags */
> +		: "D"(0), /* enter C0.2 */
> +		  "a"(tsc_l), "d"(tsc_h));
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 03/10] eal: add intrinsics support check infrastructure
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
@ 2020-10-14 17:51           ` Ananyev, Konstantin
  2020-10-14 17:59           ` Jerin Jacob
  1 sibling, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-14 17:51 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Ray Kinsella,
	Neil Horman, Richardson, Bruce, Hunt, David, Ma, Liang J,
	jerinjacobk, thomas, McDaniel, Timothy, Eads, Gage, Macnamara,
	Chris



> 
> Currently, it is not possible to check support for intrinsics that
> are platform-specific, cannot be abstracted in a generic way, or do not
> have support on all architectures. The CPUID flags can be used to some
> extent, but they are only defined for their platform, while intrinsics
> will be available to all code as they are in generic headers.
> 
> This patch introduces infrastructure to check support for certain
> platform-specific intrinsics, and adds support for checking support for
> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 03/10] eal: add intrinsics support check infrastructure
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
  2020-10-14 17:51           ` Ananyev, Konstantin
@ 2020-10-14 17:59           ` Jerin Jacob
  1 sibling, 0 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-10-14 17:59 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dpdk-dev, Jan Viktorin, Ruifeng Wang, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	David Hunt, Liang Ma, Thomas Monjalon, McDaniel, Timothy,
	Gage Eads, chris.macnamara

On Wed, Oct 14, 2020 at 7:00 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> Currently, it is not possible to check support for intrinsics that
> are platform-specific, cannot be abstracted in a generic way, or do not
> have support on all architectures. The CPUID flags can be used to some
> extent, but they are only defined for their platform, while intrinsics
> will be available to all code as they are in generic headers.
>
> This patch introduces infrastructure to check support for certain
> platform-specific intrinsics, and adds support for checking support for
> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Christensen <drc@linux.vnet.ibm.com>

Acked-by: Jerin Jacob <jerinj@marvell.com>


> ---
>
> Notes:
>     v6:
>     - Fix the comments
>
>  lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
>  lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
>  .../include/generic/rte_power_intrinsics.h    | 12 +++++++++
>  lib/librte_eal/ppc/rte_cpuflags.c             |  7 +++++
>  lib/librte_eal/rte_eal_version.map            |  1 +
>  lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
>  6 files changed, 64 insertions(+)
>
> diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
> index 7b257b7873..e3a53bcece 100644
> --- a/lib/librte_eal/arm/rte_cpuflags.c
> +++ b/lib/librte_eal/arm/rte_cpuflags.c
> @@ -151,3 +151,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>                 return NULL;
>         return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +       memset(intrinsics, 0, sizeof(*intrinsics));
> +}
> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
> index 872f0ebe3e..28a5aecde8 100644
> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
> @@ -13,6 +13,32 @@
>  #include "rte_common.h"
>  #include <errno.h>
>
> +#include <rte_compat.h>
> +
> +/**
> + * Structure used to describe platform-specific intrinsics that may or may not
> + * be supported at runtime.
> + */
> +struct rte_cpu_intrinsics {
> +       uint32_t power_monitor : 1;
> +       /**< indicates support for rte_power_monitor function */
> +       uint32_t power_pause : 1;
> +       /**< indicates support for rte_power_pause function */
> +};
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Check CPU support for various intrinsics at runtime.
> + *
> + * @param intrinsics
> + *     Pointer to a structure to be filled.
> + */
> +__rte_experimental
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
> +
>  /**
>   * Enumeration of all CPU features supported
>   */
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> index f9522f2776..1c176e4ef5 100644
> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -32,6 +32,10 @@
>   * checked against the expected value, and if they match, the entering of
>   * optimized power state may be aborted.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param p
>   *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>   * @param expected_value
> @@ -69,6 +73,10 @@ static inline void rte_power_monitor(const volatile void *p,
>   * This call will also lock a spinlock on entering sleep, and release it on
>   * waking up the CPU.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param p
>   *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>   * @param expected_value
> @@ -101,6 +109,10 @@ static inline void rte_power_monitor_sync(const volatile void *p,
>   * Enter an architecture-defined optimized power state until a certain TSC
>   * timestamp is reached.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for. Note that the wait behavior is
>   *   architecture-dependent.
> diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
> index 3bb7563ce9..61db5c216d 100644
> --- a/lib/librte_eal/ppc/rte_cpuflags.c
> +++ b/lib/librte_eal/ppc/rte_cpuflags.c
> @@ -8,6 +8,7 @@
>  #include <elf.h>
>  #include <fcntl.h>
>  #include <assert.h>
> +#include <string.h>
>  #include <unistd.h>
>
>  /* Symbolic values for the entries in the auxiliary table */
> @@ -108,3 +109,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>                 return NULL;
>         return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +       memset(intrinsics, 0, sizeof(*intrinsics));
> +}
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index a93dea9fe6..ed944f2bd4 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -400,6 +400,7 @@ EXPERIMENTAL {
>         # added in 20.11
>         __rte_eal_trace_generic_size_t;
>         rte_service_lcore_may_be_active;
> +       rte_cpu_get_intrinsics_support;
>  };
>
>  INTERNAL {
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 0325c4b93b..a96312ff7f 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -7,6 +7,7 @@
>  #include <stdio.h>
>  #include <errno.h>
>  #include <stdint.h>
> +#include <string.h>
>
>  #include "rte_cpuid.h"
>
> @@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>                 return NULL;
>         return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +       memset(intrinsics, 0, sizeof(*intrinsics));
> +
> +       if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> +               intrinsics->power_monitor = 1;
> +               intrinsics->power_pause = 1;
> +       }
> +}
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback Anatoly Burakov
  2020-10-14 14:19           ` David Hunt
@ 2020-10-14 18:41           ` Ananyev, Konstantin
  2020-10-15 10:31             ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-14 18:41 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Hunt, David, Ray Kinsella, Neil Horman, jerinjacobk,
	Richardson, Bruce, thomas, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris

> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API mandates a core-to-single-queue mapping (that is, multiple
> queued per device are supported, but they have to be polled on different
> cores).
> 
> This design is using PMD RX callbacks.
> 
> 1. UMWAIT/UMONITOR:
> 
>    When a certain threshold of empty polls is reached, the core will go
>    into a power optimized sleep while waiting on an address of next RX
>    descriptor to be written to.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this method uses the
>    pause instruction to avoid busy polling.
> 
> 3. Frequency scaling
>    Reuse existing DPDK power library to scale up/down core frequency
>    depending on traffic volume.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v6:
>     - Added wakeup mechanism for UMWAIT
>     - Removed memory allocation (everything is now allocated statically)
>     - Fixed various typos and comments
>     - Check for invalid queue ID
>     - Moved release notes to this patch
> 
>     v5:
>     - Make error checking more robust
>       - Prevent initializing scaling if ACPI or PSTATE env wasn't set
>       - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
>     - Add some debug logging
>     - Replace x86-specific code path to generic path using the intrinsic check
> 
>  doc/guides/rel_notes/release_20_11.rst |  11 +
>  lib/librte_power/meson.build           |   5 +-
>  lib/librte_power/rte_power_pmd_mgmt.c  | 300 +++++++++++++++++++++++++
>  lib/librte_power/rte_power_pmd_mgmt.h  |  92 ++++++++
>  lib/librte_power/rte_power_version.map |   4 +
>  5 files changed, 410 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
> 
> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> index ca4f43f7f9..06b822aa36 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -197,6 +197,17 @@ New Features
>    * Added new ``RTE_ACL_CLASSIFY_AVX512X32`` vector implementation,
>      which can process up to 32 flows in parallel. Requires AVX512 support.
> 
> +* **Add PMD power management mechanism**
> +
> +  3 new Ethernet PMD power management mechanism is added through existing
> +  RX callback infrastructure.
> +
> +  * Add power saving scheme based on UMWAIT instruction (x86 only)
> +  * Add power saving scheme based on ``rte_pause()``
> +  * Add power saving scheme based on frequency scaling through the power library
> +  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
> +  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
> +
> 
>  Removed Items
>  -------------
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 78c031c943..cc3c7a8646 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
>  		'power_kvm_vm.c', 'guest_channel.c',
>  		'rte_power_empty_poll.c',
>  		'power_pstate_cpufreq.c',
> +		'rte_power_pmd_mgmt.c',
>  		'power_common.c')
> -headers = files('rte_power.h','rte_power_empty_poll.h')
> -deps += ['timer']
> +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
> +deps += ['timer' ,'ethdev']
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
> new file mode 100644
> index 0000000000..2b7d2a1a46
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
> @@ -0,0 +1,300 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_cpuflags.h>
> +#include <rte_malloc.h>
> +#include <rte_ethdev.h>
> +#include <rte_power_intrinsics.h>
> +
> +#include "rte_power_pmd_mgmt.h"
> +
> +#define EMPTYPOLL_MAX  512
> +
> +/**
> + * Possible power management states of an ethdev port.
> + */
> +enum pmd_mgmt_state {
> +	/** Device power management is disabled. */
> +	PMD_MGMT_DISABLED = 0,
> +	/** Device power management is enabled. */
> +	PMD_MGMT_ENABLED,
> +};
> +
> +struct pmd_queue_cfg {
> +	enum pmd_mgmt_state pwr_mgmt_state;
> +	/**< State of power management for this queue */
> +	enum rte_power_pmd_mgmt_type cb_mode;
> +	/**< Callback mode for this queue */
> +	const struct rte_eth_rxtx_callback *cur_cb;
> +	/**< Callback instance */
> +	rte_spinlock_t umwait_lock;
> +	/**< Per-queue status lock - used only for UMWAIT mode */
> +	volatile void *wait_addr;
> +	/**< UMWAIT wakeup address */
> +	uint64_t empty_poll_stats;
> +	/**< Number of empty polls */
> +} __rte_cache_aligned;
> +
> +static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
> +
> +/* trigger a write to the cache line we're waiting on */
> +static void umwait_wakeup(volatile void *addr)
> +{
> +	uint64_t val;
> +
> +	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
> +	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
> +			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
> +}
> +
> +static uint16_t
> +clb_umwait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
> +{
> +
> +	struct pmd_queue_cfg *q_conf;
> +
> +	q_conf = &port_cfg[port_id][qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +			volatile void *target_addr;
> +			uint64_t expected, mask;
> +			uint16_t ret;
> +			uint8_t data_sz;
> +
> +			/*
> +			 * get address of next descriptor in the RX
> +			 * ring for this queue, as well as expected
> +			 * value and a mask.
> +			 */
> +			ret = rte_eth_get_wake_addr(port_id, qidx,
> +					&target_addr, &expected, &mask,
> +					&data_sz);
> +			if (ret == 0) {
> +				/*
> +				 * we need to ensure we can wake up by another
> +				 * thread triggering a write, so we need the
> +				 * address to always be up to date.
> +				 */
> +				rte_spinlock_lock(&q_conf->umwait_lock);


I think you need to check state here, and _disable() have to set state with lock grabbed.
Otherwise this lock wouldn't protect you from race conditions.
As an example:

CP@T0:
	rte_spinlock_lock(&queue_cfg->umwait_lock);
	if (queue_cfg->wait_addr != NULL) //wait_addr == NULL, fallthrough
	rte_spinlock_unlock(&queue_cfg->umwait_lock);

DP@T1:
	rte_spinlock_lock(&queue_cfg->umwait_lock);
	queue_cfg->wait_addr = target_addr;
	monitor_sync(...);  // DP was put to sleep

CP@T2:
	queue_cfg->cur_cb = NULL;
	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
	ret = 0;

rte_power_pmd_mgmt_queue_disable() finished with success,
but DP core wasn't wokenup.

To be more specific:
clb_umwait(...) {
	...
	lock(&qcfg->lck);
	if (qcfg->state == ENABLED)  {
		qcfg->wake_addr = addr;
		monitor_sync(addr, ...,&qcfg->lck);
	}
	unlock(&qcfg->lck); 
	...
}

_disable(...) {
	...
	lock(&qcfg->lck);
	qcfg->state = DISABLED;
	if (qcfg->wake_addr != NULL)
		monitor_wakeup(qcfg->wake_addr);
	unlock(&qcfg->lock);
	...
}

> +				q_conf->wait_addr = target_addr;
> +				/* -1ULL is maximum value for TSC */
> +				rte_power_monitor_sync(target_addr, expected,
> +						mask, -1ULL, data_sz,
> +						&q_conf->umwait_lock);
> +				/* erase the address */
> +				q_conf->wait_addr = NULL;
> +				rte_spinlock_unlock(&q_conf->umwait_lock);
> +			}
> +		}
> +	} else
> +		q_conf->empty_poll_stats = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +clb_pause(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
> +{
> +	struct pmd_queue_cfg *q_conf;
> +
> +	q_conf = &port_cfg[port_id][qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		/* sleep for 1 microsecond */
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
> +			rte_delay_us(1);
> +	} else
> +		q_conf->empty_poll_stats = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +clb_scale_freq(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct pmd_queue_cfg *q_conf;
> +
> +	q_conf = &port_cfg[port_id][qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
> +			/* scale down freq */
> +			rte_power_freq_min(rte_lcore_id());
> +	} else {
> +		q_conf->empty_poll_stats = 0;
> +		/* scale up freq */
> +		rte_power_freq_max(rte_lcore_id());
> +	}
> +
> +	return nb_rx;
> +}
> +
> +int
> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id,
> +		enum rte_power_pmd_mgmt_type mode)
> +{
> +	struct rte_eth_dev *dev;
> +	struct pmd_queue_cfg *queue_cfg;
> +	int ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	/* check if queue id is valid */
> +	if (queue_id >= dev->data->nb_rx_queues ||
> +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> +		return -EINVAL;
> +	}
> +
> +	queue_cfg = &port_cfg[port_id][queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> +		ret = -EINVAL;
> +		goto end;
> +	}
> +
> +	switch (mode) {
> +	case RTE_POWER_MGMT_TYPE_WAIT:
> +	{
> +		/* check if rte_power_monitor is supported */
> +		uint64_t dummy_expected, dummy_mask;
> +		struct rte_cpu_intrinsics i;
> +		volatile void *dummy_addr;
> +		uint8_t dummy_sz;
> +
> +		rte_cpu_get_intrinsics_support(&i);
> +
> +		if (!i.power_monitor) {
> +			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +
> +		/* check if the device supports the necessary PMD API */
> +		if (rte_eth_get_wake_addr(port_id, queue_id,
> +				&dummy_addr, &dummy_expected,
> +				&dummy_mask, &dummy_sz) == -ENOTSUP) {
> +			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +		/* initialize UMWAIT spinlock */
> +		rte_spinlock_init(&queue_cfg->umwait_lock);

I think don't need to do that.
It supposed to be in valid state (otherwise you are probably in trouble anyway).

> +
> +		/* initialize data before enabling the callback */
> +		queue_cfg->empty_poll_stats = 0;
> +		queue_cfg->cb_mode = mode;
> +		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +				clb_umwait, NULL);

Would be a bit cleaner/nicer to move add_rx_callback out of switch() {}
As you have to do it always anyway.
Same thought for disable() and remove_rx_callback().

> +		break;
> +	}
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +	{
> +		enum power_management_env env;
> +		/* only PSTATE and ACPI modes are supported */
> +		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> +				!rte_power_check_env_supported(
> +					PM_ENV_PSTATE_CPUFREQ)) {
> +			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +		/* ensure we could initialize the power library */
> +		if (rte_power_init(lcore_id)) {
> +			ret = -EINVAL;
> +			goto end;
> +		}
> +		/* ensure we initialized the correct env */
> +		env = rte_power_get_env();
> +		if (env != PM_ENV_ACPI_CPUFREQ &&
> +				env != PM_ENV_PSTATE_CPUFREQ) {
> +			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +		/* initialize data before enabling the callback */
> +		queue_cfg->empty_poll_stats = 0;
> +		queue_cfg->cb_mode = mode;
> +		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
> +				queue_id, clb_scale_freq, NULL);
> +		break;
> +	}
> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		/* initialize data before enabling the callback */
> +		queue_cfg->empty_poll_stats = 0;
> +		queue_cfg->cb_mode = mode;
> +		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +				clb_pause, NULL);
> +		break;
> +	}
> +	ret = 0;
> +
> +end:
> +	return ret;
> +}
> +
> +int
> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id)
> +{
> +	struct pmd_queue_cfg *queue_cfg;
> +	int ret;
> +
> +	queue_cfg = &port_cfg[port_id][queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
> +		ret = -EINVAL;
> +		goto end;
> +	}
> +
> +	switch (queue_cfg->cb_mode) {
> +	case RTE_POWER_MGMT_TYPE_WAIT:
> +		rte_spinlock_lock(&queue_cfg->umwait_lock);
> +
> +		/* wake up the core from UMWAIT sleep, if any */
> +		if (queue_cfg->wait_addr != NULL)
> +			umwait_wakeup(queue_cfg->wait_addr);
> +
> +		rte_spinlock_unlock(&queue_cfg->umwait_lock);
> +		/* fall-through */
> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +				queue_cfg->cur_cb);
> +		break;
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +		rte_power_freq_max(lcore_id);
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +				queue_cfg->cur_cb);
> +		rte_power_exit(lcore_id);
> +		break;
> +	}
> +	/*
> +	 * we don't free the RX callback here because it is unsafe to do so
> +	 * unless we know for a fact that all data plane threads have stopped.
> +	 */
> +	queue_cfg->cur_cb = NULL;
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +	ret = 0;
> +end:
> +	return ret;
> +}
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
> new file mode 100644
> index 0000000000..a7a3f98268
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.h
> @@ -0,0 +1,92 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_PMD_MGMT_H
> +#define _RTE_POWER_PMD_MGMT_H
> +
> +/**
> + * @file
> + * RTE PMD Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_power.h>
> +#include <rte_atomic.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * PMD Power Management Type
> + */
> +enum rte_power_pmd_mgmt_type {
> +	/** WAIT callback mode. */
> +	RTE_POWER_MGMT_TYPE_WAIT = 1,
> +	/** PAUSE callback mode. */
> +	RTE_POWER_MGMT_TYPE_PAUSE,
> +	/** Freq Scaling callback mode. */
> +	RTE_POWER_MGMT_TYPE_SCALE,
> +};
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Setup per-queue power management callback.
> + *
> + * @note This function is not thread-safe.
> + *
> + * @param lcore_id
> + *   lcore_id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The queue identifier of the Ethernet device.
> + * @param mode
> + *   The power management callback function type.
> +
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int
> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> +				uint16_t port_id,
> +				uint16_t queue_id,
> +				enum rte_power_pmd_mgmt_type mode);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Remove per-queue power management callback.
> + *
> + * @note This function is not thread-safe.
> + *
> + * @param lcore_id
> + *   lcore_id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The queue identifier of the Ethernet device.
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int
> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> +				uint16_t port_id,
> +				uint16_t queue_id);
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> index 69ca9af616..3f2f6cd6f6 100644
> --- a/lib/librte_power/rte_power_version.map
> +++ b/lib/librte_power/rte_power_version.map
> @@ -34,4 +34,8 @@ EXPERIMENTAL {
>  	rte_power_guest_channel_receive_msg;
>  	rte_power_poll_stat_fetch;
>  	rte_power_poll_stat_update;
> +	# added in 20.11
> +	rte_power_pmd_mgmt_queue_enable;
> +	rte_power_pmd_mgmt_queue_disable;
> +
>  };
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 02/10] eal: add power management intrinsics
  2020-10-14 17:48           ` Ananyev, Konstantin
@ 2020-10-15 10:09             ` Burakov, Anatoly
  2020-10-15 10:45               ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-15 10:09 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Ma, Liang J, Jan Viktorin, Ruifeng Wang, David Christensen,
	Richardson, Bruce, Hunt, David, jerinjacobk, thomas, McDaniel,
	Timothy, Eads, Gage, Macnamara, Chris

On 14-Oct-20 6:48 PM, Ananyev, Konstantin wrote:
> 
> 
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Add two new power management intrinsics, and provide an implementation
>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>> are implemented as raw byte opcodes because there is not yet widespread
>> compiler support for these instructions.
>>
>> The power management instructions provide an architecture-specific
>> function to either wait until a specified TSC timestamp is reached, or
>> optionally wait until either a TSC timestamp is reached or a memory
>> location is written to. The monitor function also provides an optional
>> comparison, to avoid sleeping when the expected write has already
>> happened, and no more writes are expected.
>>
>> For more details, please refer to Intel(R) 64 and IA-32 Architectures
>> Software Developer's Manual, Volume 2.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
>> ---
>>
>> Notes:
>>      v6:
>>      - Add spinlock-enabled version to allow pthread-wait-like
>>        constructs with umwait
>>      - Clarify comments
>>      - Added experimental tags to intrinsics
>>      - Added endianness support
>>      v5:
>>      - Removed return values
>>      - Simplified intrinsics and hardcoded C0.2 state
>>      - Added other arch stubs
>>
>>   lib/librte_eal/arm/include/meson.build        |   1 +
>>   .../arm/include/rte_power_intrinsics.h        |  58 ++++++++
>>   .../include/generic/rte_power_intrinsics.h    | 111 +++++++++++++++
>>   lib/librte_eal/include/meson.build            |   1 +
>>   lib/librte_eal/ppc/include/meson.build        |   1 +
>>   .../ppc/include/rte_power_intrinsics.h        |  58 ++++++++
>>   lib/librte_eal/x86/include/meson.build        |   1 +
>>   .../x86/include/rte_power_intrinsics.h        | 132 ++++++++++++++++++
>>   8 files changed, 363 insertions(+)
>>   create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
>>   create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>>   create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
>>   create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
>>
>> diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
>> index 73b750a18f..c6a9f70d73 100644
>> --- a/lib/librte_eal/arm/include/meson.build
>> +++ b/lib/librte_eal/arm/include/meson.build
>> @@ -20,6 +20,7 @@ arch_headers = files(
>>   'rte_pause_32.h',
>>   'rte_pause_64.h',
>>   'rte_pause.h',
>> +'rte_power_intrinsics.h',
>>   'rte_prefetch_32.h',
>>   'rte_prefetch_64.h',
>>   'rte_prefetch.h',
>> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> new file mode 100644
>> index 0000000000..b04ba10c76
>> --- /dev/null
>> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> @@ -0,0 +1,58 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2020 Intel Corporation
>> + */
>> +
>> +#ifndef _RTE_POWER_INTRINSIC_ARM_H_
>> +#define _RTE_POWER_INTRINSIC_ARM_H_
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +#include <rte_common.h>
>> +
>> +#include "generic/rte_power_intrinsics.h"
>> +
>> +/**
>> + * This function is not supported on ARM.
>> + */
> 
> Here and in other places - please follow dpdk coding convention
> for function definitions, i.e:
> static inline void
> rte_power_monitor(...
> 

Yep, will do.

>> +static inline void rte_power_monitor(const volatile void *p,
>> +const uint64_t expected_value, const uint64_t value_mask,
>> +const uint64_t tsc_timestamp, const uint8_t data_sz)
>> +{
>> +RTE_SET_USED(p);
>> +RTE_SET_USED(expected_value);
>> +RTE_SET_USED(value_mask);
>> +RTE_SET_USED(tsc_timestamp);
>> +RTE_SET_USED(data_sz);
>> +}
> 
> You can probably put NOP implementations of these rte_powe_* functions
> into generic/rte_power_intrinsics.h.
> So, wouldn't need to duplicate them for every non-supported arch.
> Same as it was done for rte_wait_until_equal_*().
> 

Will look into it.

>> + *
>> + * This file define APIs for advanced power management,
>> + * which are architecture-dependent.
>> + */
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Monitor specific address for changes. This will cause the CPU to enter an
>> + * architecture-defined optimized power state until either the specified
>> + * memory address is written to, a certain TSC timestamp is reached, or other
>> + * reasons cause the CPU to wake up.
>> + *
>> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
>> + * mask is non-zero, the current value pointed to by the `p` pointer will be
>> + * checked against the expected value, and if they match, the entering of
>> + * optimized power state may be aborted.
>> + *
>> + * @param p
>> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> 
> Is 64B alignment really needed?

I'm not 100% sure to be honest, but it's there just in case. I can 
remove it.

>>   'rte_prefetch.h',
>> +'rte_power_intrinsics.h',
>>   'rte_pause.h',
>>   'rte_rtm.h',
>>   'rte_rwlock.h',
>> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
>> new file mode 100644
>> index 0000000000..9ac8e6eef6
>> --- /dev/null
>> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
>> @@ -0,0 +1,132 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2020 Intel Corporation
>> + */
>> +
>> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
>> +#define _RTE_POWER_INTRINSIC_X86_64_H_
> 
> Why '_64_H'?
> My understanding was these ops are supported 32-bit mode too.

Yeah, artifact of early implementation. Will fix.

> 
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +#include <rte_common.h>
>> +
>> +#include "generic/rte_power_intrinsics.h"
>> +
>> +static inline uint64_t __get_umwait_val(const volatile void *p,
>> +const uint8_t sz)
>> +{
>> +switch (sz) {
>> +case 1:
> 
> Just as a nit:
> case sizeof(type_x):
> return *(const volatile type_x *)p;

Thanks, will fix.

> 
>> +return *(const volatile uint8_t *)p;
>> +case 2:
>> +return *(const volatile uint16_t *)p;
>> +case 4:
>> +return *(const volatile uint32_t *)p;
>> +case 8:
>> +return *(const volatile uint64_t *)p;
>> +default:
>> +/* this is an intrinsic, so we can't have any error handling */
> 
> RTE_ASSERT(0); ?

Great idea, will add.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback
  2020-10-14 18:41           ` Ananyev, Konstantin
@ 2020-10-15 10:31             ` Burakov, Anatoly
  2020-10-15 16:02               ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-15 10:31 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Ma, Liang J, Hunt, David, Ray Kinsella, Neil Horman, jerinjacobk,
	Richardson, Bruce, thomas, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris

On 14-Oct-20 7:41 PM, Ananyev, Konstantin wrote:
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Add a simple on/off switch that will enable saving power when no
>> packets are arriving. It is based on counting the number of empty
>> polls and, when the number reaches a certain threshold, entering an
>> architecture-defined optimized power state that will either wait
>> until a TSC timestamp expires, or when packets arrive.
>>
>> This API mandates a core-to-single-queue mapping (that is, multiple
>> queued per device are supported, but they have to be polled on different
>> cores).
>>
>> This design is using PMD RX callbacks.
>>
>> 1. UMWAIT/UMONITOR:
>>
>>     When a certain threshold of empty polls is reached, the core will go
>>     into a power optimized sleep while waiting on an address of next RX
>>     descriptor to be written to.
>>
>> 2. Pause instruction
>>
>>     Instead of move the core into deeper C state, this method uses the
>>     pause instruction to avoid busy polling.
>>
>> 3. Frequency scaling
>>     Reuse existing DPDK power library to scale up/down core frequency
>>     depending on traffic volume.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v6:
>>      - Added wakeup mechanism for UMWAIT
>>      - Removed memory allocation (everything is now allocated statically)
>>      - Fixed various typos and comments
>>      - Check for invalid queue ID
>>      - Moved release notes to this patch
>>
>>      v5:
>>      - Make error checking more robust
>>        - Prevent initializing scaling if ACPI or PSTATE env wasn't set
>>        - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
>>      - Add some debug logging
>>      - Replace x86-specific code path to generic path using the intrinsic check
>>


<snip>

> 
> 
> I think you need to check state here, and _disable() have to set state with lock grabbed.
> Otherwise this lock wouldn't protect you from race conditions.
> As an example:
> 
> CP@T0:
> rte_spinlock_lock(&queue_cfg->umwait_lock);
> if (queue_cfg->wait_addr != NULL) //wait_addr == NULL, fallthrough
> rte_spinlock_unlock(&queue_cfg->umwait_lock);
> 
> DP@T1:
> rte_spinlock_lock(&queue_cfg->umwait_lock);
> queue_cfg->wait_addr = target_addr;
> monitor_sync(...);  // DP was put to sleep
> 
> CP@T2:
> queue_cfg->cur_cb = NULL;
> queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> ret = 0;
> 
> rte_power_pmd_mgmt_queue_disable() finished with success,
> but DP core wasn't wokenup.
> 
> To be more specific:
> clb_umwait(...) {
> ...
> lock(&qcfg->lck);
> if (qcfg->state == ENABLED)  {
> qcfg->wake_addr = addr;
> monitor_sync(addr, ...,&qcfg->lck);
> }
> unlock(&qcfg->lck);
> ...
> }
> 
> _disable(...) {
> ...
> lock(&qcfg->lck);
> qcfg->state = DISABLED;
> if (qcfg->wake_addr != NULL)
> monitor_wakeup(qcfg->wake_addr);
> unlock(&qcfg->lock);
> ...
> }

True, didn't think of that. Will fix.

>> +
>> +if (!i.power_monitor) {
>> +RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>> +ret = -ENOTSUP;
>> +goto end;
>> +}
>> +
>> +/* check if the device supports the necessary PMD API */
>> +if (rte_eth_get_wake_addr(port_id, queue_id,
>> +&dummy_addr, &dummy_expected,
>> +&dummy_mask, &dummy_sz) == -ENOTSUP) {
>> +RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
>> +ret = -ENOTSUP;
>> +goto end;
>> +}
>> +/* initialize UMWAIT spinlock */
>> +rte_spinlock_init(&queue_cfg->umwait_lock);
> 
> I think don't need to do that.
> It supposed to be in valid state (otherwise you are probably in trouble anyway).

This is mostly for initialization, for when we first run the callback. I 
suppose we could do it in an RTE_INIT() function or just leave it be 
since the spinlocks are part of a statically allocated structure and 
will default to 0 anyway (although it wouldn't be proper usage of the 
API as that would be relying on implementation detail).

> 
>> +
>> +/* initialize data before enabling the callback */
>> +queue_cfg->empty_poll_stats = 0;
>> +queue_cfg->cb_mode = mode;
>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> +
>> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +clb_umwait, NULL);
> 
> Would be a bit cleaner/nicer to move add_rx_callback out of switch() {}
> As you have to do it always anyway.
> Same thought for disable() and remove_rx_callback().

The functions are different for each, so we can't move them out of 
switch (unless you're suggesting to unify the callback to handle all 
three modes?).

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 02/10] eal: add power management intrinsics
  2020-10-15 10:09             ` Burakov, Anatoly
@ 2020-10-15 10:45               ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-15 10:45 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Ma, Liang J, Jan Viktorin, Ruifeng Wang, David Christensen,
	Richardson, Bruce, Hunt, David, jerinjacobk, thomas, McDaniel,
	Timothy, Eads, Gage, Macnamara, Chris

>>> +static inline void rte_power_monitor(const volatile void *p,
>>> +const uint64_t expected_value, const uint64_t value_mask,
>>> +const uint64_t tsc_timestamp, const uint8_t data_sz)
>>> +{
>>> +RTE_SET_USED(p);
>>> +RTE_SET_USED(expected_value);
>>> +RTE_SET_USED(value_mask);
>>> +RTE_SET_USED(tsc_timestamp);
>>> +RTE_SET_USED(data_sz);
>>> +}
>>
>> You can probably put NOP implementations of these rte_powe_* functions
>> into generic/rte_power_intrinsics.h.
>> So, wouldn't need to duplicate them for every non-supported arch.
>> Same as it was done for rte_wait_until_equal_*().
>>
> 
> Will look into it.
> 
To be completely honest, i don't like that approach. The ifdefery in 
generic headers looks ugly and out of place, i'd rather leave everything 
in arch specific header files and provide stubs there.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 04/10] ethdev: add simple power management API
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 04/10] ethdev: add simple power management API Anatoly Burakov
  2020-10-14 17:06           ` Ananyev, Konstantin
@ 2020-10-15 11:29           ` Liang, Ma
  1 sibling, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-10-15 11:29 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, timothy.mcdaniel, gage.eads,
	chris.macnamara

On 14 Oct 14:30, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add a simple API to allow getting address of next RX descriptor from the
> PMD, as well as release notes information.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
<snip>
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 3a31f94367..005faba455 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -4119,6 +4119,32 @@ __rte_experimental
>  int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  	struct rte_eth_burst_mode *mode);
>  
> +/**
> + * Retrieve the wake up address for the receive queue.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Rx queue on the Ethernet device for which information
> + *   will be retrieved.
> + * @param wake_addr
> + *   The pointer point to the address which is used for monitoring.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + *
need add the comment of data_sz.
> + * @return
> + *   - 0: Success.
> + *   -ENOTSUP: Operation not supported.
> + *   -EINVAL: Invalid parameters.
> + *   -ENODEV: Invalid port ID.
> + */
> +__rte_experimental
> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
> +		uint8_t *data_sz);
> +
>  /**
>   * Retrieve device registers and register attributes (number of registers and
>   * register size)
> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
> index 35cc4fb186..76b179de42 100644
> --- a/lib/librte_ethdev/rte_ethdev_driver.h
> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
> @@ -655,6 +655,32 @@ typedef int (*eth_fec_get_t)(struct rte_eth_dev *dev,
>   */
>  typedef int (*eth_fec_set_t)(struct rte_eth_dev *dev, uint32_t fec_capa);
>  
> +/**
> + * @internal
> + * Get address of memory location whose contents will change whenever there is
> + * new data to be received on an RX queue.
> + *
> + * @param rxq
> + *   Ethdev queue pointer.
> + * @param tail_desc_addr
> + *   The pointer point to where the address will be stored.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + * @param data_sz
> + *   Data size for the expected value (can be 1, 2, 4, or 8 bytes)
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success
> + * @retval -EINVAL
> + *   Invalid parameters
> + */
> +typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -801,6 +827,8 @@ struct eth_dev_ops {
>  	/**< Get Forward Error Correction(FEC) mode. */
>  	eth_fec_set_t fec_set;
>  	/**< Set Forward Error Correction(FEC) mode. */
> +	eth_get_wake_addr_t get_wake_addr;
> +	/**< Get next RX queue ring entry address. */
>  };
>  
>  /**
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> index f8a0945812..6c2ea5996d 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -232,6 +232,7 @@ EXPERIMENTAL {
>  	rte_eth_fec_get_capability;
>  	rte_eth_fec_get;
>  	rte_eth_fec_set;
> +	rte_eth_get_wake_addr;
>  };
>  
>  INTERNAL {
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v7 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
@ 2020-10-15 12:04           ` Anatoly Burakov
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
                               ` (9 more replies)
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics Anatoly Burakov
                             ` (8 subsequent siblings)
  9 siblings, 10 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-15 12:04 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Bruce Richardson, Konstantin Ananyev, david.hunt,
	jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a new CPUID flag indicating processor support for UMONITOR/UMWAIT
and TPAUSE instructions instruction.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v6:
    - Fix typos
    - Better commit message

 lib/librte_eal/x86/include/rte_cpuflags.h | 1 +
 lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..848ba9cbfb 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	RTE_CPUFLAG_WAITPKG,                /**< UMONITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
@ 2020-10-15 12:04           ` Anatoly Burakov
  2020-10-15 12:06             ` Jerin Jacob
                               ` (3 more replies)
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
                             ` (7 subsequent siblings)
  9 siblings, 4 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-15 12:04 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jan Viktorin, Ruifeng Wang, David Christensen,
	Bruce Richardson, Konstantin Ananyev, david.hunt, jerinjacobk,
	thomas, timothy.mcdaniel, gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

For more details, please refer to Intel(R) 64 and IA-32 Architectures
Software Developer's Manual, Volume 2.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
---

Notes:
    v7:
    - Fix code style and other nitpicks (Konstantin)
    v6:
    - Add spinlock-enabled version to allow pthread-wait-like
      constructs with umwait
    - Clarify comments
    - Added experimental tags to intrinsics
    - Added endianness support
    v5:
    - Removed return values
    - Simplified intrinsics and hardcoded C0.2 state
    - Added other arch stubs

 lib/librte_eal/arm/include/meson.build        |   1 +
 .../arm/include/rte_power_intrinsics.h        |  60 ++++++++
 .../include/generic/rte_power_intrinsics.h    | 111 ++++++++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/ppc/include/meson.build        |   1 +
 .../ppc/include/rte_power_intrinsics.h        |  60 ++++++++
 lib/librte_eal/x86/include/meson.build        |   1 +
 .../x86/include/rte_power_intrinsics.h        | 135 ++++++++++++++++++
 8 files changed, 370 insertions(+)
 create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
index 73b750a18f..c6a9f70d73 100644
--- a/lib/librte_eal/arm/include/meson.build
+++ b/lib/librte_eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
 	'rte_pause_32.h',
 	'rte_pause_64.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch_32.h',
 	'rte_prefetch_64.h',
 	'rte_prefetch.h',
diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..a4a1bc1159
--- /dev/null
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_ARM_H_
+#define _RTE_POWER_INTRINSIC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..fb897d9060
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,111 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ * @param data_sz
+ *   Data size (in bytes) that will be used to compare expected value with the
+ *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
+ *   to undefined result.
+ */
+__rte_experimental
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This call will also lock a spinlock on entering sleep, and release it on
+ * waking up the CPU.
+ *
+ * @param p
+ *   Address to monitor for changes.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ * @param data_sz
+ *   Data size (in bytes) that will be used to compare expected value with the
+ *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
+ *   to undefined result.
+ * @param lck
+ *   A spinlock that must be locked before entering the function, will be
+ *   unlocked while the CPU is sleeping, and will be locked again once the CPU
+ *   wakes up.
+ */
+__rte_experimental
+static inline void rte_power_monitor_sync(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz,
+		rte_spinlock_t *lck);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ */
+__rte_experimental
+static inline void rte_power_pause(const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
index ab4bd28092..0873b2aecb 100644
--- a/lib/librte_eal/ppc/include/meson.build
+++ b/lib/librte_eal/ppc/include/meson.build
@@ -10,6 +10,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch.h',
 	'rte_rwlock.h',
 	'rte_spinlock.h',
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..4ed03d521f
--- /dev/null
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_PPC_H_
+#define _RTE_POWER_INTRINSIC_PPC_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..f9b761d796
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_H_
+#define _RTE_POWER_INTRINSIC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+static inline uint64_t
+__get_umwait_val(const volatile void *p, const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):
+		return *(const volatile uint8_t *)p;
+	case sizeof(uint16_t):
+		return *(const volatile uint16_t *)p;
+	case sizeof(uint32_t):
+		return *(const volatile uint32_t *)p;
+	case sizeof(uint64_t):
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		RTE_ASSERT(0);
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+		: /* ignore rflags */
+		: "D"(0), /* enter C0.2 */
+		  "a"(tsc_l), "d"(tsc_h));
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v7 03/10] eal: add intrinsics support check infrastructure
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics Anatoly Burakov
@ 2020-10-15 12:04           ` Anatoly Burakov
  2020-10-16  9:02             ` Ruifeng Wang
  2020-10-16 11:21             ` Kinsella, Ray
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 04/10] ethdev: add simple power management API Anatoly Burakov
                             ` (6 subsequent siblings)
  9 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-15 12:04 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Ray Kinsella,
	Neil Horman, Bruce Richardson, Konstantin Ananyev, david.hunt,
	liang.j.ma, jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

Currently, it is not possible to check support for intrinsics that
are platform-specific, cannot be abstracted in a generic way, or do not
have support on all architectures. The CPUID flags can be used to some
extent, but they are only defined for their platform, while intrinsics
will be available to all code as they are in generic headers.

This patch introduces infrastructure to check support for certain
platform-specific intrinsics, and adds support for checking support for
IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---

Notes:
    v6:
    - Fix the comments

 lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
 lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
 .../include/generic/rte_power_intrinsics.h    | 12 +++++++++
 lib/librte_eal/ppc/rte_cpuflags.c             |  7 +++++
 lib/librte_eal/rte_eal_version.map            |  1 +
 lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
 6 files changed, 64 insertions(+)

diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
index 7b257b7873..e3a53bcece 100644
--- a/lib/librte_eal/arm/rte_cpuflags.c
+++ b/lib/librte_eal/arm/rte_cpuflags.c
@@ -151,3 +151,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
index 872f0ebe3e..28a5aecde8 100644
--- a/lib/librte_eal/include/generic/rte_cpuflags.h
+++ b/lib/librte_eal/include/generic/rte_cpuflags.h
@@ -13,6 +13,32 @@
 #include "rte_common.h"
 #include <errno.h>
 
+#include <rte_compat.h>
+
+/**
+ * Structure used to describe platform-specific intrinsics that may or may not
+ * be supported at runtime.
+ */
+struct rte_cpu_intrinsics {
+	uint32_t power_monitor : 1;
+	/**< indicates support for rte_power_monitor function */
+	uint32_t power_pause : 1;
+	/**< indicates support for rte_power_pause function */
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Check CPU support for various intrinsics at runtime.
+ *
+ * @param intrinsics
+ *     Pointer to a structure to be filled.
+ */
+__rte_experimental
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
+
 /**
  * Enumeration of all CPU features supported
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index fb897d9060..03a326f076 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -32,6 +32,10 @@
  * checked against the expected value, and if they match, the entering of
  * optimized power state may be aborted.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes.
  * @param expected_value
@@ -69,6 +73,10 @@ static inline void rte_power_monitor(const volatile void *p,
  * This call will also lock a spinlock on entering sleep, and release it on
  * waking up the CPU.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes.
  * @param expected_value
@@ -101,6 +109,10 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  * Enter an architecture-defined optimized power state until a certain TSC
  * timestamp is reached.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
index 3bb7563ce9..61db5c216d 100644
--- a/lib/librte_eal/ppc/rte_cpuflags.c
+++ b/lib/librte_eal/ppc/rte_cpuflags.c
@@ -8,6 +8,7 @@
 #include <elf.h>
 #include <fcntl.h>
 #include <assert.h>
+#include <string.h>
 #include <unistd.h>
 
 /* Symbolic values for the entries in the auxiliary table */
@@ -108,3 +109,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index a93dea9fe6..ed944f2bd4 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -400,6 +400,7 @@ EXPERIMENTAL {
 	# added in 20.11
 	__rte_eal_trace_generic_size_t;
 	rte_service_lcore_may_be_active;
+	rte_cpu_get_intrinsics_support;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 0325c4b93b..a96312ff7f 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -7,6 +7,7 @@
 #include <stdio.h>
 #include <errno.h>
 #include <stdint.h>
+#include <string.h>
 
 #include "rte_cpuid.h"
 
@@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
+		intrinsics->power_monitor = 1;
+		intrinsics->power_pause = 1;
+	}
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v7 04/10] ethdev: add simple power management API
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
                             ` (2 preceding siblings ...)
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
@ 2020-10-15 12:04           ` Anatoly Burakov
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 05/10] power: add PMD power management API and callback Anatoly Burakov
                             ` (5 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-15 12:04 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple API to allow getting address of next RX descriptor from the
PMD, as well as release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v7:
    - Fixed queue ID validation
    - Fixed documentation
    
    v6:
    - Rebase on top of latest main
    - Ensure the API checks queue ID (Konstantin)
    - Removed accidental inclusion of unrelated release notes
    v5:
    - Bring function format in line with other functions in the file
    - Ensure the API is supported by the driver before calling it (Konstantin)

 doc/guides/rel_notes/release_20_11.rst   |  8 ++++++-
 lib/librte_ethdev/rte_ethdev.c           | 23 +++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h           | 28 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_version.map |  1 +
 5 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index c61d7fcf67..4c6a615ce9 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -71,7 +71,13 @@ New Features
 * **Added the FEC API, for a generic FEC query and config.**
 
   Added the FEC API which provides functions for query FEC capabilities and
-  current FEC mode from device. Also, API for configuring FEC mode is also provided.
+  current FEC mode from device. Also, API for configuring FEC mode is also
+  provided.
+
+* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**
+
+  * ``rte_eth_get_wake_addr()``
+  * add new eth_dev_ops ``get_wake_addr``
 
 * **Updated Broadcom bnxt driver.**
 
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 59beb8aec2..d972a3a656 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -4844,6 +4844,29 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_wake_addr(dev->data->rx_queues[queue_id],
+			wake_addr, expected, mask, data_sz));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 3a31f94367..008bc3cce0 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4119,6 +4119,34 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * Retrieve the wake up address for the receive queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information will be
+ *   retrieved.
+ * @param wake_addr
+ *   The pointer to the address which will be monitored.
+ * @param expected
+ *   The pointer to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer to comparison bitmask for the expected value.
+ * @param data_sz
+ *   The pointer to data size for the expected value and comparison bitmask.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index 35cc4fb186..76b179de42 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -655,6 +655,32 @@ typedef int (*eth_fec_get_t)(struct rte_eth_dev *dev,
  */
 typedef int (*eth_fec_set_t)(struct rte_eth_dev *dev, uint32_t fec_capa);
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an RX queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param tail_desc_addr
+ *   The pointer point to where the address will be stored.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ * @param data_sz
+ *   Data size for the expected value (can be 1, 2, 4, or 8 bytes)
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -801,6 +827,8 @@ struct eth_dev_ops {
 	/**< Get Forward Error Correction(FEC) mode. */
 	eth_fec_set_t fec_set;
 	/**< Set Forward Error Correction(FEC) mode. */
+	eth_get_wake_addr_t get_wake_addr;
+	/**< Get next RX queue ring entry address. */
 };
 
 /**
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index f8a0945812..6c2ea5996d 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -232,6 +232,7 @@ EXPERIMENTAL {
 	rte_eth_fec_get_capability;
 	rte_eth_fec_get;
 	rte_eth_fec_set;
+	rte_eth_get_wake_addr;
 };
 
 INTERNAL {
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v7 05/10] power: add PMD power management API and callback
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
                             ` (3 preceding siblings ...)
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 04/10] ethdev: add simple power management API Anatoly Burakov
@ 2020-10-15 12:04           ` Anatoly Burakov
  2020-10-15 16:52             ` Ananyev, Konstantin
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 06/10] net/ixgbe: implement power management API Anatoly Burakov
                             ` (4 subsequent siblings)
  9 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-15 12:04 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman,
	konstantin.ananyev, jerinjacobk, bruce.richardson, thomas,
	timothy.mcdaniel, gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. Pause instruction

   Instead of move the core into deeper C state, this method uses the
   pause instruction to avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v7:
    - Fixed race condition (Konstantin)
    - Slight rework of the structure of monitor code
    - Added missing inline for wakeup
    
    v6:
    - Added wakeup mechanism for UMWAIT
    - Removed memory allocation (everything is now allocated statically)
    - Fixed various typos and comments
    - Check for invalid queue ID
    - Moved release notes to this patch
    
    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
    - Add some debug logging
    - Replace x86-specific code path to generic path using the intrinsic check

 doc/guides/rel_notes/release_20_11.rst |  11 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 320 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  92 +++++++
 lib/librte_power/rte_power_version.map |   4 +
 5 files changed, 430 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 4c6a615ce9..a814c67d54 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -204,6 +204,17 @@ New Features
 
    * Added support to update subport rate dynamically.
 
+* **Add PMD power management mechanism**
+
+  3 new Ethernet PMD power management mechanism is added through existing
+  RX callback infrastructure.
+
+  * Add power saving scheme based on UMWAIT instruction (x86 only)
+  * Add power saving scheme based on ``rte_pause()``
+  * Add power saving scheme based on frequency scaling through the power library
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..344b12d5ff
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,320 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	rte_spinlock_t umwait_lock;
+	/**< Per-queue status lock - used only for UMWAIT mode */
+	volatile void *wait_addr;
+	/**< UMWAIT wakeup address */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+/* trigger a write to the cache line we're waiting on */
+static inline void
+umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
+static inline void
+umwait_sleep(struct pmd_queue_cfg *q_conf, uint16_t port_id, uint16_t qidx)
+{
+	volatile void *target_addr;
+	uint64_t expected, mask;
+	uint8_t data_sz;
+	uint16_t ret;
+
+	/*
+	 * get wake up address fot this RX queue, as well as expected value,
+	 * comparison mask, and data size.
+	 */
+	ret = rte_eth_get_wake_addr(port_id, qidx, &target_addr,
+			&expected, &mask, &data_sz);
+
+	/* this should always succeed as all checks have been done already */
+	if (unlikely(ret != 0))
+		return;
+
+	/*
+	 * take out a spinlock to prevent control plane from concurrently
+	 * modifying the wakeup data.
+	 */
+	rte_spinlock_lock(&q_conf->umwait_lock);
+
+	/* have we been disabled by control plane? */
+	if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		/* we're good to go */
+
+		/*
+		 * store the wakeup address so that control plane can trigger a
+		 * write to this address and wake us up.
+		 */
+		q_conf->wait_addr = target_addr;
+		/* -1ULL is maximum value for TSC */
+		rte_power_monitor_sync(target_addr, expected, mask, -1ULL,
+				data_sz, &q_conf->umwait_lock);
+		/* erase the address */
+		q_conf->wait_addr = NULL;
+	}
+	rte_spinlock_unlock(&q_conf->umwait_lock);
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			umwait_sleep(q_conf, port_id, qidx);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			rte_delay_us(1);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* check if queue id is valid */
+	if (queue_id >= dev->data->nb_rx_queues ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		return -EINVAL;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	{
+		/* check if rte_power_monitor is supported */
+		uint64_t dummy_expected, dummy_mask;
+		struct rte_cpu_intrinsics i;
+		volatile void *dummy_addr;
+		uint8_t dummy_sz;
+
+		rte_cpu_get_intrinsics_support(&i);
+
+		if (!i.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_wake_addr(port_id, queue_id,
+				&dummy_addr, &dummy_expected,
+				&dummy_mask, &dummy_sz) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize UMWAIT spinlock */
+		rte_spinlock_init(&queue_cfg->umwait_lock);
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+		rte_spinlock_lock(&queue_cfg->umwait_lock);
+
+		/* wake up the core from UMWAIT sleep, if any */
+		if (queue_cfg->wait_addr != NULL)
+			umwait_wakeup(queue_cfg->wait_addr);
+		/*
+		 * we need to disable early as there might be callback currently
+		 * spinning on a lock.
+		 */
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+		rte_spinlock_unlock(&queue_cfg->umwait_lock);
+		/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	ret = 0;
+end:
+	return ret;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..a7a3f98268
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v7 06/10] net/ixgbe: implement power management API
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
                             ` (4 preceding siblings ...)
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 05/10] power: add PMD power management API and callback Anatoly Burakov
@ 2020-10-15 12:04           ` Anatoly Burakov
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 07/10] net/i40e: " Anatoly Burakov
                             ` (3 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-15 12:04 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jeff Guo, Haiyue Wang, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 0b98e210e7..30b3f416d4 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_wake_addr        = ixgbe_get_wake_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 29d385c062..b1d656d270 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	*data_sz = 4;
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 0b5589ef4d..6b9afec655 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v7 07/10] net/i40e: implement power management API
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
                             ` (5 preceding siblings ...)
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 06/10] net/ixgbe: implement power management API Anatoly Burakov
@ 2020-10-15 12:04           ` Anatoly Burakov
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 08/10] net/ice: " Anatoly Burakov
                             ` (2 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-15 12:04 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Beilei Xing, Jeff Guo, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 943cfe71dc..cab86f8ec9 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_wake_addr	              = i40e_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index f2844d3f74..cdb1cd494b 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,32 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	*data_sz = 8;
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..5826cf1099 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value, uint8_t *data_sz);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v7 08/10] net/ice: implement power management API
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
                             ` (6 preceding siblings ...)
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 07/10] net/i40e: " Anatoly Burakov
@ 2020-10-15 12:04           ` Anatoly Burakov
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 10/10] doc: update programmer's guide for power library Anatoly Burakov
  9 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-15 12:04 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Qiming Yang, Qi Zhang, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index d8ce09d28f..260de5dfd7 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_wake_addr	              = ice_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 79e6df11f4..7c0f963d96 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -25,6 +25,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	*data_sz = 2;
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1c23c7541e..7eeb8d467e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v7 09/10] examples/l3fwd-power: enable PMD power mgmt
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
                             ` (7 preceding siblings ...)
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 08/10] net/ice: " Anatoly Burakov
@ 2020-10-15 12:04           ` Anatoly Burakov
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 10/10] doc: update programmer's guide for power library Anatoly Burakov
  9 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-15 12:04 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, konstantin.ananyev, jerinjacobk,
	bruce.richardson, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v6:
    - Fixed typos in documentation

 .../sample_app_ug/l3_forward_power_man.rst    | 13 ++++++
 examples/l3fwd-power/main.c                   | 41 ++++++++++++++++++-
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 0cc6f2e62e..2767fb91aa 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -459,3 +461,14 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index d0e6c9bd77..af64dd521f 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,7 +200,8 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
@@ -1750,6 +1752,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1771,6 +1774,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1881,6 +1885,16 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt  mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2437,6 +2451,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2705,6 +2721,12 @@ main(int argc, char **argv)
 			} else if (!check_ptype(portid))
 				rte_exit(EXIT_FAILURE,
 					 "PMD can not provide needed ptypes\n");
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				rte_power_pmd_mgmt_queue_enable(lcore_id,
+							portid, queueid,
+						RTE_POWER_MGMT_TYPE_SCALE);
+
+			}
 		}
 	}
 
@@ -2790,6 +2812,9 @@ main(int argc, char **argv)
 						SKIP_MASTER);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MASTER);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL,
+					 CALL_MASTER);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2812,6 +2837,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v7 10/10] doc: update programmer's guide for power library
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
                             ` (8 preceding siblings ...)
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2020-10-15 12:04           ` Anatoly Burakov
  9 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-15 12:04 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, konstantin.ananyev, jerinjacobk,
	bruce.richardson, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Update programmer's guide to document PMD power management usage.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v5:
    - Moved l3fwd-power update to the l3fwd-power-related commit
    - Some rewordings and clarifications

 doc/guides/prog_guide/power_man.rst | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..38c64d31e4 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,45 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * UMWAIT/UMONITOR
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX descriptor
+   address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will use the `rte_pause` function to avoid busy
+   polling.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing power library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +239,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics Anatoly Burakov
@ 2020-10-15 12:06             ` Jerin Jacob
  2020-10-15 13:16             ` Ferruh Yigit
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-10-15 12:06 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dpdk-dev, Liang Ma, Jan Viktorin, Ruifeng Wang,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	David Hunt, Thomas Monjalon, McDaniel, Timothy, Gage Eads,
	chris.macnamara

On Thu, Oct 15, 2020 at 5:34 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> From: Liang Ma <liang.j.ma@intel.com>
>
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
>
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
>
> For more details, please refer to Intel(R) 64 and IA-32 Architectures
> Software Developer's Manual, Volume 2.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Christensen <drc@linux.vnet.ibm.com>

Acked-by: Jerin Jacob <jerinj@marvell.com>



> ---
>
> Notes:
>     v7:
>     - Fix code style and other nitpicks (Konstantin)
>     v6:
>     - Add spinlock-enabled version to allow pthread-wait-like
>       constructs with umwait
>     - Clarify comments
>     - Added experimental tags to intrinsics
>     - Added endianness support
>     v5:
>     - Removed return values
>     - Simplified intrinsics and hardcoded C0.2 state
>     - Added other arch stubs
>
>  lib/librte_eal/arm/include/meson.build        |   1 +
>  .../arm/include/rte_power_intrinsics.h        |  60 ++++++++
>  .../include/generic/rte_power_intrinsics.h    | 111 ++++++++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/ppc/include/meson.build        |   1 +
>  .../ppc/include/rte_power_intrinsics.h        |  60 ++++++++
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 135 ++++++++++++++++++
>  8 files changed, 370 insertions(+)
>  create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
>
> diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
> index 73b750a18f..c6a9f70d73 100644
> --- a/lib/librte_eal/arm/include/meson.build
> +++ b/lib/librte_eal/arm/include/meson.build
> @@ -20,6 +20,7 @@ arch_headers = files(
>         'rte_pause_32.h',
>         'rte_pause_64.h',
>         'rte_pause.h',
> +       'rte_power_intrinsics.h',
>         'rte_prefetch_32.h',
>         'rte_prefetch_64.h',
>         'rte_prefetch.h',
> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..a4a1bc1159
> --- /dev/null
> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_ARM_H_
> +#define _RTE_POWER_INTRINSIC_ARM_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * This function is not supported on ARM.
> + */
> +static inline void
> +rte_power_monitor(const volatile void *p, const uint64_t expected_value,
> +               const uint64_t value_mask, const uint64_t tsc_timestamp,
> +               const uint8_t data_sz)
> +{
> +       RTE_SET_USED(p);
> +       RTE_SET_USED(expected_value);
> +       RTE_SET_USED(value_mask);
> +       RTE_SET_USED(tsc_timestamp);
> +       RTE_SET_USED(data_sz);
> +}
> +
> +/**
> + * This function is not supported on ARM.
> + */
> +static inline void
> +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
> +               const uint64_t value_mask, const uint64_t tsc_timestamp,
> +               const uint8_t data_sz, rte_spinlock_t *lck)
> +{
> +       RTE_SET_USED(p);
> +       RTE_SET_USED(expected_value);
> +       RTE_SET_USED(value_mask);
> +       RTE_SET_USED(tsc_timestamp);
> +       RTE_SET_USED(lck);
> +       RTE_SET_USED(data_sz);
> +}
> +
> +/**
> + * This function is not supported on ARM.
> + */
> +static inline void
> +rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..fb897d9060
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,111 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +#include <rte_compat.h>
> +#include <rte_spinlock.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, a certain TSC timestamp is reached, or other
> + * reasons cause the CPU to wake up.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + * @param data_sz
> + *   Data size (in bytes) that will be used to compare expected value with the
> + *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
> + *   to undefined result.
> + */
> +__rte_experimental
> +static inline void rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint64_t tsc_timestamp, const uint8_t data_sz);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, a certain TSC timestamp is reached, or other
> + * reasons cause the CPU to wake up.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This call will also lock a spinlock on entering sleep, and release it on
> + * waking up the CPU.
> + *
> + * @param p
> + *   Address to monitor for changes.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + * @param data_sz
> + *   Data size (in bytes) that will be used to compare expected value with the
> + *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
> + *   to undefined result.
> + * @param lck
> + *   A spinlock that must be locked before entering the function, will be
> + *   unlocked while the CPU is sleeping, and will be locked again once the CPU
> + *   wakes up.
> + */
> +__rte_experimental
> +static inline void rte_power_monitor_sync(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint64_t tsc_timestamp, const uint8_t data_sz,
> +               rte_spinlock_t *lck);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + */
> +__rte_experimental
> +static inline void rte_power_pause(const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index cd09027958..3a12e87e19 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -60,6 +60,7 @@ generic_headers = files(
>         'generic/rte_memcpy.h',
>         'generic/rte_pause.h',
>         'generic/rte_prefetch.h',
> +       'generic/rte_power_intrinsics.h',
>         'generic/rte_rwlock.h',
>         'generic/rte_spinlock.h',
>         'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
> index ab4bd28092..0873b2aecb 100644
> --- a/lib/librte_eal/ppc/include/meson.build
> +++ b/lib/librte_eal/ppc/include/meson.build
> @@ -10,6 +10,7 @@ arch_headers = files(
>         'rte_io.h',
>         'rte_memcpy.h',
>         'rte_pause.h',
> +       'rte_power_intrinsics.h',
>         'rte_prefetch.h',
>         'rte_rwlock.h',
>         'rte_spinlock.h',
> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..4ed03d521f
> --- /dev/null
> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_PPC_H_
> +#define _RTE_POWER_INTRINSIC_PPC_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * This function is not supported on PPC64.
> + */
> +static inline void
> +rte_power_monitor(const volatile void *p, const uint64_t expected_value,
> +               const uint64_t value_mask, const uint64_t tsc_timestamp,
> +               const uint8_t data_sz)
> +{
> +       RTE_SET_USED(p);
> +       RTE_SET_USED(expected_value);
> +       RTE_SET_USED(value_mask);
> +       RTE_SET_USED(tsc_timestamp);
> +       RTE_SET_USED(data_sz);
> +}
> +
> +/**
> + * This function is not supported on PPC64.
> + */
> +static inline void
> +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
> +               const uint64_t value_mask, const uint64_t tsc_timestamp,
> +               const uint8_t data_sz, rte_spinlock_t *lck)
> +{
> +       RTE_SET_USED(p);
> +       RTE_SET_USED(expected_value);
> +       RTE_SET_USED(value_mask);
> +       RTE_SET_USED(tsc_timestamp);
> +       RTE_SET_USED(lck);
> +       RTE_SET_USED(data_sz);
> +}
> +
> +/**
> + * This function is not supported on PPC64.
> + */
> +static inline void
> +rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>         'rte_io.h',
>         'rte_memcpy.h',
>         'rte_prefetch.h',
> +       'rte_power_intrinsics.h',
>         'rte_pause.h',
>         'rte_rtm.h',
>         'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..f9b761d796
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,135 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_H_
> +#define _RTE_POWER_INTRINSIC_X86_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +static inline uint64_t
> +__get_umwait_val(const volatile void *p, const uint8_t sz)
> +{
> +       switch (sz) {
> +       case sizeof(uint8_t):
> +               return *(const volatile uint8_t *)p;
> +       case sizeof(uint16_t):
> +               return *(const volatile uint16_t *)p;
> +       case sizeof(uint32_t):
> +               return *(const volatile uint32_t *)p;
> +       case sizeof(uint64_t):
> +               return *(const volatile uint64_t *)p;
> +       default:
> +               /* this is an intrinsic, so we can't have any error handling */
> +               RTE_ASSERT(0);
> +               return 0;
> +       }
> +}
> +
> +/**
> + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
> + * For more information about usage of these instructions, please refer to
> + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
> + */
> +static inline void
> +rte_power_monitor(const volatile void *p, const uint64_t expected_value,
> +               const uint64_t value_mask, const uint64_t tsc_timestamp,
> +               const uint8_t data_sz)
> +{
> +       const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +       const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +       /*
> +        * we're using raw byte codes for now as only the newest compiler
> +        * versions support this instruction natively.
> +        */
> +
> +       /* set address for UMONITOR */
> +       asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +                       :
> +                       : "D"(p));
> +
> +       if (value_mask) {
> +               const uint64_t cur_value = __get_umwait_val(p, data_sz);
> +               const uint64_t masked = cur_value & value_mask;
> +
> +               /* if the masked value is already matching, abort */
> +               if (masked == expected_value)
> +                       return;
> +       }
> +       /* execute UMWAIT */
> +       asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
> +                       : /* ignore rflags */
> +                       : "D"(0), /* enter C0.2 */
> +                         "a"(tsc_l), "d"(tsc_h));
> +}
> +
> +/**
> + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
> + * For more information about usage of these instructions, please refer to
> + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
> + */
> +static inline void
> +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
> +               const uint64_t value_mask, const uint64_t tsc_timestamp,
> +               const uint8_t data_sz, rte_spinlock_t *lck)
> +{
> +       const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +       const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +       /*
> +        * we're using raw byte codes for now as only the newest compiler
> +        * versions support this instruction natively.
> +        */
> +
> +       /* set address for UMONITOR */
> +       asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +                       :
> +                       : "D"(p));
> +
> +       if (value_mask) {
> +               const uint64_t cur_value = __get_umwait_val(p, data_sz);
> +               const uint64_t masked = cur_value & value_mask;
> +
> +               /* if the masked value is already matching, abort */
> +               if (masked == expected_value)
> +                       return;
> +       }
> +       rte_spinlock_unlock(lck);
> +
> +       /* execute UMWAIT */
> +       asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
> +                       : /* ignore rflags */
> +                       : "D"(0), /* enter C0.2 */
> +                         "a"(tsc_l), "d"(tsc_h));
> +
> +       rte_spinlock_lock(lck);
> +}
> +
> +/**
> + * This function uses TPAUSE instruction  and will enter C0.2 state. For more
> + * information about usage of this instruction, please refer to Intel(R) 64 and
> + * IA-32 Architectures Software Developer's Manual.
> + */
> +static inline void
> +rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +       const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +
> +       /* execute TPAUSE */
> +       asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
> +               : /* ignore rflags */
> +               : "D"(0), /* enter C0.2 */
> +                 "a"(tsc_l), "d"(tsc_h));
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_H_ */
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics Anatoly Burakov
  2020-10-15 12:06             ` Jerin Jacob
@ 2020-10-15 13:16             ` Ferruh Yigit
  2020-10-16  8:44               ` Ruifeng Wang
  2020-10-15 16:43             ` Ananyev, Konstantin
  2020-10-19 21:12             ` Thomas Monjalon
  3 siblings, 1 reply; 421+ messages in thread
From: Ferruh Yigit @ 2020-10-15 13:16 UTC (permalink / raw)
  To: Anatoly Burakov, dev, Ruifeng Wang
  Cc: Liang Ma, Jan Viktorin, David Christensen, Bruce Richardson,
	Konstantin Ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, chris.macnamara

On 10/15/2020 1:04 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> For more details, please refer to Intel(R) 64 and IA-32 Architectures
> Software Developer's Manual, Volume 2.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> ---
> 
> Notes:
>      v7:
>      - Fix code style and other nitpicks (Konstantin)
>      v6:
>      - Add spinlock-enabled version to allow pthread-wait-like
>        constructs with umwait
>      - Clarify comments
>      - Added experimental tags to intrinsics
>      - Added endianness support
>      v5:
>      - Removed return values
>      - Simplified intrinsics and hardcoded C0.2 state
>      - Added other arch stubs
> 

Hi Ruifeng,

This is the patch we have talked in today's release status meeting, can you 
please check the patch from Arm perspective?
Since the instructions are not supported by Arm I expect it should be OK but it 
would be good to get your ack to proceed.

Thanks,
ferruh

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback
  2020-10-15 10:31             ` Burakov, Anatoly
@ 2020-10-15 16:02               ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-15 16:02 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Hunt, David, Ray Kinsella, Neil Horman, jerinjacobk,
	Richardson, Bruce, thomas, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris

> On 14-Oct-20 7:41 PM, Ananyev, Konstantin wrote:
> >> From: Liang Ma <liang.j.ma@intel.com>
> >>
> >> Add a simple on/off switch that will enable saving power when no
> >> packets are arriving. It is based on counting the number of empty
> >> polls and, when the number reaches a certain threshold, entering an
> >> architecture-defined optimized power state that will either wait
> >> until a TSC timestamp expires, or when packets arrive.
> >>
> >> This API mandates a core-to-single-queue mapping (that is, multiple
> >> queued per device are supported, but they have to be polled on different
> >> cores).
> >>
> >> This design is using PMD RX callbacks.
> >>
> >> 1. UMWAIT/UMONITOR:
> >>
> >>     When a certain threshold of empty polls is reached, the core will go
> >>     into a power optimized sleep while waiting on an address of next RX
> >>     descriptor to be written to.
> >>
> >> 2. Pause instruction
> >>
> >>     Instead of move the core into deeper C state, this method uses the
> >>     pause instruction to avoid busy polling.
> >>
> >> 3. Frequency scaling
> >>     Reuse existing DPDK power library to scale up/down core frequency
> >>     depending on traffic volume.
> >>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>
> >> Notes:
> >>      v6:
> >>      - Added wakeup mechanism for UMWAIT
> >>      - Removed memory allocation (everything is now allocated statically)
> >>      - Fixed various typos and comments
> >>      - Check for invalid queue ID
> >>      - Moved release notes to this patch
> >>
> >>      v5:
> >>      - Make error checking more robust
> >>        - Prevent initializing scaling if ACPI or PSTATE env wasn't set
> >>        - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
> >>      - Add some debug logging
> >>      - Replace x86-specific code path to generic path using the intrinsic check
> >>
> 
> 
> <snip>
> 
> >
> >
> > I think you need to check state here, and _disable() have to set state with lock grabbed.
> > Otherwise this lock wouldn't protect you from race conditions.
> > As an example:
> >
> > CP@T0:
> > rte_spinlock_lock(&queue_cfg->umwait_lock);
> > if (queue_cfg->wait_addr != NULL) //wait_addr == NULL, fallthrough
> > rte_spinlock_unlock(&queue_cfg->umwait_lock);
> >
> > DP@T1:
> > rte_spinlock_lock(&queue_cfg->umwait_lock);
> > queue_cfg->wait_addr = target_addr;
> > monitor_sync(...);  // DP was put to sleep
> >
> > CP@T2:
> > queue_cfg->cur_cb = NULL;
> > queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> > ret = 0;
> >
> > rte_power_pmd_mgmt_queue_disable() finished with success,
> > but DP core wasn't wokenup.
> >
> > To be more specific:
> > clb_umwait(...) {
> > ...
> > lock(&qcfg->lck);
> > if (qcfg->state == ENABLED)  {
> > qcfg->wake_addr = addr;
> > monitor_sync(addr, ...,&qcfg->lck);
> > }
> > unlock(&qcfg->lck);
> > ...
> > }
> >
> > _disable(...) {
> > ...
> > lock(&qcfg->lck);
> > qcfg->state = DISABLED;
> > if (qcfg->wake_addr != NULL)
> > monitor_wakeup(qcfg->wake_addr);
> > unlock(&qcfg->lock);
> > ...
> > }
> 
> True, didn't think of that. Will fix.
> 
> >> +
> >> +if (!i.power_monitor) {
> >> +RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> >> +ret = -ENOTSUP;
> >> +goto end;
> >> +}
> >> +
> >> +/* check if the device supports the necessary PMD API */
> >> +if (rte_eth_get_wake_addr(port_id, queue_id,
> >> +&dummy_addr, &dummy_expected,
> >> +&dummy_mask, &dummy_sz) == -ENOTSUP) {
> >> +RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
> >> +ret = -ENOTSUP;
> >> +goto end;
> >> +}
> >> +/* initialize UMWAIT spinlock */
> >> +rte_spinlock_init(&queue_cfg->umwait_lock);
> >
> > I think don't need to do that.
> > It supposed to be in valid state (otherwise you are probably in trouble anyway).
> 
> This is mostly for initialization, for when we first run the callback. I
> suppose we could do it in an RTE_INIT() function or just leave it be
> since the spinlocks are part of a statically allocated structure and
> will default to 0 anyway

Yes static initialization should be sufficient here.

 (although it wouldn't be proper usage of the
> API as that would be relying on implementation detail).

What I am trying to say - at that moment spinlock value should be zero.
If it is not, then your DP callback is still running and holding the lock.
I understand that it should be *very* rare situation, but it seems 
strange to me to silently over-write the locked value from other thread.
It shouldn't cause any major trouble, but still...

> 
> >
> >> +
> >> +/* initialize data before enabling the callback */
> >> +queue_cfg->empty_poll_stats = 0;
> >> +queue_cfg->cb_mode = mode;
> >> +queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> >> +
> >> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> +clb_umwait, NULL);
> >
> > Would be a bit cleaner/nicer to move add_rx_callback out of switch() {}
> > As you have to do it always anyway.
> > Same thought for disable() and remove_rx_callback().
> 
> The functions are different for each, so we can't move them out of
> switch (unless you're suggesting to unify the callback to handle all
> three modes?).
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics Anatoly Burakov
  2020-10-15 12:06             ` Jerin Jacob
  2020-10-15 13:16             ` Ferruh Yigit
@ 2020-10-15 16:43             ` Ananyev, Konstantin
  2020-10-19 21:12             ` Thomas Monjalon
  3 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-15 16:43 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Jan Viktorin, Ruifeng Wang, David Christensen,
	Richardson, Bruce, Hunt, David, jerinjacobk, thomas, McDaniel,
	Timothy, Eads, Gage, Macnamara, Chris


> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> For more details, please refer to Intel(R) 64 and IA-32 Architectures
> Software Developer's Manual, Volume 2.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> ---
> 

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 05/10] power: add PMD power management API and callback
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 05/10] power: add PMD power management API and callback Anatoly Burakov
@ 2020-10-15 16:52             ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-15 16:52 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Hunt, David, Ray Kinsella, Neil Horman, jerinjacobk,
	Richardson, Bruce, thomas, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris



> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API mandates a core-to-single-queue mapping (that is, multiple
> queued per device are supported, but they have to be polled on different
> cores).
> 
> This design is using PMD RX callbacks.
> 
> 1. UMWAIT/UMONITOR:
> 
>    When a certain threshold of empty polls is reached, the core will go
>    into a power optimized sleep while waiting on an address of next RX
>    descriptor to be written to.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this method uses the
>    pause instruction to avoid busy polling.
> 
> 3. Frequency scaling
>    Reuse existing DPDK power library to scale up/down core frequency
>    depending on traffic volume.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Hunt <david.hunt@intel.com>
> ---
> 
> Notes:
>     v7:
>     - Fixed race condition (Konstantin)
>     - Slight rework of the structure of monitor code
>     - Added missing inline for wakeup
> 
>     v6:
>     - Added wakeup mechanism for UMWAIT
>     - Removed memory allocation (everything is now allocated statically)
>     - Fixed various typos and comments
>     - Check for invalid queue ID
>     - Moved release notes to this patch
> 
>     v5:
>     - Make error checking more robust
>       - Prevent initializing scaling if ACPI or PSTATE env wasn't set
>       - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
>     - Add some debug logging
>     - Replace x86-specific code path to generic path using the intrinsic check
> 
>  doc/guides/rel_notes/release_20_11.rst |  11 +
>  lib/librte_power/meson.build           |   5 +-
>  lib/librte_power/rte_power_pmd_mgmt.c  | 320 +++++++++++++++++++++++++
>  lib/librte_power/rte_power_pmd_mgmt.h  |  92 +++++++
>  lib/librte_power/rte_power_version.map |   4 +
>  5 files changed, 430 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
> 
> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> index 4c6a615ce9..a814c67d54 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -204,6 +204,17 @@ New Features
> 
>     * Added support to update subport rate dynamically.
> 
> +* **Add PMD power management mechanism**
> +
> +  3 new Ethernet PMD power management mechanism is added through existing
> +  RX callback infrastructure.
> +
> +  * Add power saving scheme based on UMWAIT instruction (x86 only)
> +  * Add power saving scheme based on ``rte_pause()``
> +  * Add power saving scheme based on frequency scaling through the power library
> +  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
> +  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
> +
> 
>  Removed Items
>  -------------

....

> +
> +int
> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id,
> +		enum rte_power_pmd_mgmt_type mode)
> +{
> +	struct rte_eth_dev *dev;
> +	struct pmd_queue_cfg *queue_cfg;
> +	int ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	/* check if queue id is valid */
> +	if (queue_id >= dev->data->nb_rx_queues ||
> +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> +		return -EINVAL;
> +	}
> +
> +	queue_cfg = &port_cfg[port_id][queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> +		ret = -EINVAL;
> +		goto end;
> +	}
> +
> +	switch (mode) {
> +	case RTE_POWER_MGMT_TYPE_WAIT:
> +	{
> +		/* check if rte_power_monitor is supported */
> +		uint64_t dummy_expected, dummy_mask;
> +		struct rte_cpu_intrinsics i;
> +		volatile void *dummy_addr;
> +		uint8_t dummy_sz;
> +
> +		rte_cpu_get_intrinsics_support(&i);
> +
> +		if (!i.power_monitor) {
> +			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +
> +		/* check if the device supports the necessary PMD API */
> +		if (rte_eth_get_wake_addr(port_id, queue_id,
> +				&dummy_addr, &dummy_expected,
> +				&dummy_mask, &dummy_sz) == -ENOTSUP) {
> +			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +		/* initialize UMWAIT spinlock */
> +		rte_spinlock_init(&queue_cfg->umwait_lock);

Still looks excessive and possibly error prone to me.
Apart from that:
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> +
> +		/* initialize data before enabling the callback */
> +		queue_cfg->empty_poll_stats = 0;
> +		queue_cfg->cb_mode = mode;
> +		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +				clb_umwait, NULL);
> +		break;
> +	}
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +	{
> +		enum power_management_env env;
> +		/* only PSTATE and ACPI modes are supported */
> +		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> +				!rte_power_check_env_supported(
> +					PM_ENV_PSTATE_CPUFREQ)) {
> +			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +		/* ensure we could initialize the power library */
> +		if (rte_power_init(lcore_id)) {
> +			ret = -EINVAL;
> +			goto end;
> +		}
> +		/* ensure we initialized the correct env */
> +		env = rte_power_get_env();
> +		if (env != PM_ENV_ACPI_CPUFREQ &&
> +				env != PM_ENV_PSTATE_CPUFREQ) {
> +			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +		/* initialize data before enabling the callback */
> +		queue_cfg->empty_poll_stats = 0;
> +		queue_cfg->cb_mode = mode;
> +		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
> +				queue_id, clb_scale_freq, NULL);
> +		break;
> +	}
> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		/* initialize data before enabling the callback */
> +		queue_cfg->empty_poll_stats = 0;
> +		queue_cfg->cb_mode = mode;
> +		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +				clb_pause, NULL);
> +		break;
> +	}
> +	ret = 0;
> +
> +end:
> +	return ret;
> +}
> +

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-15 13:16             ` Ferruh Yigit
@ 2020-10-16  8:44               ` Ruifeng Wang
  0 siblings, 0 replies; 421+ messages in thread
From: Ruifeng Wang @ 2020-10-16  8:44 UTC (permalink / raw)
  To: Ferruh Yigit, Anatoly Burakov, dev
  Cc: Liang Ma, Jan Viktorin, David Christensen, Bruce Richardson,
	Konstantin Ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, chris.macnamara, nd


> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> Sent: Thursday, October 15, 2020 9:17 PM
> To: Anatoly Burakov <anatoly.burakov@intel.com>; dev@dpdk.org; Ruifeng
> Wang <Ruifeng.Wang@arm.com>
> Cc: Liang Ma <liang.j.ma@intel.com>; Jan Viktorin
> <viktorin@rehivetech.com>; David Christensen <drc@linux.vnet.ibm.com>;
> Bruce Richardson <bruce.richardson@intel.com>; Konstantin Ananyev
> <konstantin.ananyev@intel.com>; david.hunt@intel.com;
> jerinjacobk@gmail.com; thomas@monjalon.net;
> timothy.mcdaniel@intel.com; gage.eads@intel.com;
> chris.macnamara@intel.com
> Subject: Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management
> intrinsics
> 
> On 10/15/2020 1:04 PM, Anatoly Burakov wrote:
> > From: Liang Ma <liang.j.ma@intel.com>
> >
> > Add two new power management intrinsics, and provide an
> implementation
> > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions are
> > implemented as raw byte opcodes because there is not yet widespread
> > compiler support for these instructions.
> >
> > The power management instructions provide an architecture-specific
> > function to either wait until a specified TSC timestamp is reached, or
> > optionally wait until either a TSC timestamp is reached or a memory
> > location is written to. The monitor function also provides an optional
> > comparison, to avoid sleeping when the expected write has already
> > happened, and no more writes are expected.
> >
> > For more details, please refer to Intel(R) 64 and IA-32 Architectures
> > Software Developer's Manual, Volume 2.
> >
> > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> > ---
> >
> > Notes:
> >      v7:
> >      - Fix code style and other nitpicks (Konstantin)
> >      v6:
> >      - Add spinlock-enabled version to allow pthread-wait-like
> >        constructs with umwait
> >      - Clarify comments
> >      - Added experimental tags to intrinsics
> >      - Added endianness support
> >      v5:
> >      - Removed return values
> >      - Simplified intrinsics and hardcoded C0.2 state
> >      - Added other arch stubs
> >
> 
> Hi Ruifeng,
> 
> This is the patch we have talked in today's release status meeting, can you
> please check the patch from Arm perspective?
> Since the instructions are not supported by Arm I expect it should be OK but
> it would be good to get your ack to proceed.
> 
Thanks for pointing me to this.
Generally looks good to me.

Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Thanks,
> ferruh

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 03/10] eal: add intrinsics support check infrastructure
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
@ 2020-10-16  9:02             ` Ruifeng Wang
  2020-10-16 11:21             ` Kinsella, Ray
  1 sibling, 0 replies; 421+ messages in thread
From: Ruifeng Wang @ 2020-10-16  9:02 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman,
	Bruce Richardson, Konstantin Ananyev, david.hunt, liang.j.ma,
	jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara, nd


> -----Original Message-----
> From: Anatoly Burakov <anatoly.burakov@intel.com>
> Sent: Thursday, October 15, 2020 8:04 PM
> To: dev@dpdk.org
> Cc: Jan Viktorin <viktorin@rehivetech.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; David Christensen <drc@linux.vnet.ibm.com>;
> Ray Kinsella <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>;
> Bruce Richardson <bruce.richardson@intel.com>; Konstantin Ananyev
> <konstantin.ananyev@intel.com>; david.hunt@intel.com;
> liang.j.ma@intel.com; jerinjacobk@gmail.com; thomas@monjalon.net;
> timothy.mcdaniel@intel.com; gage.eads@intel.com;
> chris.macnamara@intel.com
> Subject: [PATCH v7 03/10] eal: add intrinsics support check infrastructure
> 
> Currently, it is not possible to check support for intrinsics that are platform-
> specific, cannot be abstracted in a generic way, or do not have support on all
> architectures. The CPUID flags can be used to some extent, but they are only
> defined for their platform, while intrinsics will be available to all code as they
> are in generic headers.
> 
> This patch introduces infrastructure to check support for certain platform-
> specific intrinsics, and adds support for checking support for IA power
> management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
> 
> Notes:
>     v6:
>     - Fix the comments
> 
>  lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
>  lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
>  .../include/generic/rte_power_intrinsics.h    | 12 +++++++++
>  lib/librte_eal/ppc/rte_cpuflags.c             |  7 +++++
>  lib/librte_eal/rte_eal_version.map            |  1 +
>  lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
>  6 files changed, 64 insertions(+)
> 
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 03/10] eal: add intrinsics support check infrastructure
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
  2020-10-16  9:02             ` Ruifeng Wang
@ 2020-10-16 11:21             ` Kinsella, Ray
  1 sibling, 0 replies; 421+ messages in thread
From: Kinsella, Ray @ 2020-10-16 11:21 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Neil Horman,
	Bruce Richardson, Konstantin Ananyev, david.hunt, liang.j.ma,
	jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara



On 15/10/2020 13:04, Anatoly Burakov wrote:
> Currently, it is not possible to check support for intrinsics that
> are platform-specific, cannot be abstracted in a generic way, or do not
> have support on all architectures. The CPUID flags can be used to some
> extent, but they are only defined for their platform, while intrinsics
> will be available to all code as they are in generic headers.
> 
> This patch introduces infrastructure to check support for certain
> platform-specific intrinsics, and adds support for checking support for
> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
> 
> Notes:
>     v6:
>     - Fix the comments
> 
>  lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
>  lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
>  .../include/generic/rte_power_intrinsics.h    | 12 +++++++++
>  lib/librte_eal/ppc/rte_cpuflags.c             |  7 +++++
>  lib/librte_eal/rte_eal_version.map            |  1 +
>  lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
>  6 files changed, 64 insertions(+)

Acked-by: Ray Kinsella <mdr@ashroe.eu>

> diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
> index 7b257b7873..e3a53bcece 100644
> --- a/lib/librte_eal/arm/rte_cpuflags.c
> +++ b/lib/librte_eal/arm/rte_cpuflags.c
> @@ -151,3 +151,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>  		return NULL;
>  	return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +	memset(intrinsics, 0, sizeof(*intrinsics));
> +}
> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
> index 872f0ebe3e..28a5aecde8 100644
> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
> @@ -13,6 +13,32 @@
>  #include "rte_common.h"
>  #include <errno.h>
>  
> +#include <rte_compat.h>
> +
> +/**
> + * Structure used to describe platform-specific intrinsics that may or may not
> + * be supported at runtime.
> + */
> +struct rte_cpu_intrinsics {
> +	uint32_t power_monitor : 1;
> +	/**< indicates support for rte_power_monitor function */
> +	uint32_t power_pause : 1;
> +	/**< indicates support for rte_power_pause function */
> +};
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Check CPU support for various intrinsics at runtime.
> + *
> + * @param intrinsics
> + *     Pointer to a structure to be filled.
> + */
> +__rte_experimental
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
> +
>  /**
>   * Enumeration of all CPU features supported
>   */
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> index fb897d9060..03a326f076 100644
> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -32,6 +32,10 @@
>   * checked against the expected value, and if they match, the entering of
>   * optimized power state may be aborted.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param p
>   *   Address to monitor for changes.
>   * @param expected_value
> @@ -69,6 +73,10 @@ static inline void rte_power_monitor(const volatile void *p,
>   * This call will also lock a spinlock on entering sleep, and release it on
>   * waking up the CPU.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param p
>   *   Address to monitor for changes.
>   * @param expected_value
> @@ -101,6 +109,10 @@ static inline void rte_power_monitor_sync(const volatile void *p,
>   * Enter an architecture-defined optimized power state until a certain TSC
>   * timestamp is reached.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for. Note that the wait behavior is
>   *   architecture-dependent.
> diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
> index 3bb7563ce9..61db5c216d 100644
> --- a/lib/librte_eal/ppc/rte_cpuflags.c
> +++ b/lib/librte_eal/ppc/rte_cpuflags.c
> @@ -8,6 +8,7 @@
>  #include <elf.h>
>  #include <fcntl.h>
>  #include <assert.h>
> +#include <string.h>
>  #include <unistd.h>
>  
>  /* Symbolic values for the entries in the auxiliary table */
> @@ -108,3 +109,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>  		return NULL;
>  	return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +	memset(intrinsics, 0, sizeof(*intrinsics));
> +}
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index a93dea9fe6..ed944f2bd4 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -400,6 +400,7 @@ EXPERIMENTAL {
>  	# added in 20.11
>  	__rte_eal_trace_generic_size_t;
>  	rte_service_lcore_may_be_active;
> +	rte_cpu_get_intrinsics_support;
>  };
>  
>  INTERNAL {
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 0325c4b93b..a96312ff7f 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -7,6 +7,7 @@
>  #include <stdio.h>
>  #include <errno.h>
>  #include <stdint.h>
> +#include <string.h>
>  
>  #include "rte_cpuid.h"
>  
> @@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>  		return NULL;
>  	return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +	memset(intrinsics, 0, sizeof(*intrinsics));
> +
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> +		intrinsics->power_monitor = 1;
> +		intrinsics->power_pause = 1;
> +	}
> +}
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics Anatoly Burakov
                               ` (2 preceding siblings ...)
  2020-10-15 16:43             ` Ananyev, Konstantin
@ 2020-10-19 21:12             ` Thomas Monjalon
  2020-10-20  2:49               ` Ruifeng Wang
  3 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-19 21:12 UTC (permalink / raw)
  To: Ruifeng Wang, honnappa.nagarahalli
  Cc: dev, Liang Ma, Jan Viktorin, David Christensen, Bruce Richardson,
	Konstantin Ananyev, david.hunt, jerinjacobk, timothy.mcdaniel,
	gage.eads, chris.macnamara, Anatoly Burakov, david.marchand

15/10/2020 14:04, Anatoly Burakov:
> +/**
> + * This function is not supported on ARM.
> + */
> +static inline void
> +rte_power_monitor(const volatile void *p, const uint64_t expected_value,
> +               const uint64_t value_mask, const uint64_t tsc_timestamp,
> +               const uint8_t data_sz)
> +{
> +       RTE_SET_USED(p);
> +       RTE_SET_USED(expected_value);
> +       RTE_SET_USED(value_mask);
> +       RTE_SET_USED(tsc_timestamp);
> +       RTE_SET_USED(data_sz);
> +}

Are you sure it cannot be partially supported with WFE instruction?



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-19 21:12             ` Thomas Monjalon
@ 2020-10-20  2:49               ` Ruifeng Wang
  2020-10-20  7:35                 ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Ruifeng Wang @ 2020-10-20  2:49 UTC (permalink / raw)
  To: thomas, Honnappa Nagarahalli
  Cc: dev, Liang Ma, Jan Viktorin, David Christensen, Bruce Richardson,
	Konstantin Ananyev, david.hunt, jerinjacobk, timothy.mcdaniel,
	gage.eads, chris.macnamara, Anatoly Burakov, david.marchand, nd


> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Tuesday, October 20, 2020 5:13 AM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Cc: dev@dpdk.org; Liang Ma <liang.j.ma@intel.com>; Jan Viktorin
> <viktorin@rehivetech.com>; David Christensen <drc@linux.vnet.ibm.com>;
> Bruce Richardson <bruce.richardson@intel.com>; Konstantin Ananyev
> <konstantin.ananyev@intel.com>; david.hunt@intel.com;
> jerinjacobk@gmail.com; timothy.mcdaniel@intel.com; gage.eads@intel.com;
> chris.macnamara@intel.com; Anatoly Burakov <anatoly.burakov@intel.com>;
> david.marchand@redhat.com
> Subject: Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management
> intrinsics
> 
> 15/10/2020 14:04, Anatoly Burakov:
> > +/**
> > + * This function is not supported on ARM.
> > + */
> > +static inline void
> > +rte_power_monitor(const volatile void *p, const uint64_t
> expected_value,
> > +               const uint64_t value_mask, const uint64_t tsc_timestamp,
> > +               const uint8_t data_sz) {
> > +       RTE_SET_USED(p);
> > +       RTE_SET_USED(expected_value);
> > +       RTE_SET_USED(value_mask);
> > +       RTE_SET_USED(tsc_timestamp);
> > +       RTE_SET_USED(data_sz);
> > +}
> 
> Are you sure it cannot be partially supported with WFE instruction?
> 
Armv8 WFE instruction can support monitoring of specific address for changes, 
but not monitoring of TSC timestamp. 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-20  2:49               ` Ruifeng Wang
@ 2020-10-20  7:35                 ` Thomas Monjalon
  2020-10-20 14:01                   ` David Hunt
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-20  7:35 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Ruifeng Wang
  Cc: dev, Liang Ma, Jan Viktorin, David Christensen, Bruce Richardson,
	Konstantin Ananyev, david.hunt, jerinjacobk, timothy.mcdaniel,
	gage.eads, chris.macnamara, Anatoly Burakov, david.marchand, nd,
	David Christensen

20/10/2020 04:49, Ruifeng Wang:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 15/10/2020 14:04, Anatoly Burakov:
> > > +/**
> > > + * This function is not supported on ARM.
> > > + */
> > > +static inline void
> > > +rte_power_monitor(const volatile void *p, const uint64_t
> > expected_value,
> > > +               const uint64_t value_mask, const uint64_t tsc_timestamp,
> > > +               const uint8_t data_sz) {
> > > +       RTE_SET_USED(p);
> > > +       RTE_SET_USED(expected_value);
> > > +       RTE_SET_USED(value_mask);
> > > +       RTE_SET_USED(tsc_timestamp);
> > > +       RTE_SET_USED(data_sz);
> > > +}
> > 
> > Are you sure it cannot be partially supported with WFE instruction?
> > 
> Armv8 WFE instruction can support monitoring of specific address for changes, 
> but not monitoring of TSC timestamp. 

So it is a partial support.

We must try hard to unify architectures support
to avoid #ifdef everywhere.

I don't agree with how are managed new instructions recently.
Please look further.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-20  7:35                 ` Thomas Monjalon
@ 2020-10-20 14:01                   ` David Hunt
  2020-10-20 14:17                     ` David Hunt
  0 siblings, 1 reply; 421+ messages in thread
From: David Hunt @ 2020-10-20 14:01 UTC (permalink / raw)
  To: Thomas Monjalon, Honnappa Nagarahalli, Ruifeng Wang
  Cc: dev, Liang Ma, Jan Viktorin, David Christensen, Bruce Richardson,
	Konstantin Ananyev, jerinjacobk, timothy.mcdaniel, gage.eads,
	chris.macnamara, Anatoly Burakov, david.marchand, nd


On 20/10/2020 8:35 AM, Thomas Monjalon wrote:
> 20/10/2020 04:49, Ruifeng Wang:
>> From: Thomas Monjalon <thomas@monjalon.net>
>>> 15/10/2020 14:04, Anatoly Burakov:
>>>> +/**
>>>> + * This function is not supported on ARM.
>>>> + */
>>>> +static inline void
>>>> +rte_power_monitor(const volatile void *p, const uint64_t
>>> expected_value,
>>>> +               const uint64_t value_mask, const uint64_t tsc_timestamp,
>>>> +               const uint8_t data_sz) {
>>>> +       RTE_SET_USED(p);
>>>> +       RTE_SET_USED(expected_value);
>>>> +       RTE_SET_USED(value_mask);
>>>> +       RTE_SET_USED(tsc_timestamp);
>>>> +       RTE_SET_USED(data_sz);
>>>> +}
>>> Are you sure it cannot be partially supported with WFE instruction?
>>>
>> Armv8 WFE instruction can support monitoring of specific address for changes,
>> but not monitoring of TSC timestamp.
> So it is a partial support.
>
> We must try hard to unify architectures support
> to avoid #ifdef everywhere.
>
> I don't agree with how are managed new instructions recently.
> Please look further.
>

Hi Thomas,

We believe this is ready for -rc1, can we discuss this with the 
technical board before the RC1 tag is applied?

Regards,
Dave.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-20 14:01                   ` David Hunt
@ 2020-10-20 14:17                     ` David Hunt
  2020-10-20 14:33                       ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: David Hunt @ 2020-10-20 14:17 UTC (permalink / raw)
  To: Thomas Monjalon, Honnappa Nagarahalli, Ruifeng Wang
  Cc: dev, Liang Ma, Jan Viktorin, David Christensen, Bruce Richardson,
	Konstantin Ananyev, jerinjacobk, timothy.mcdaniel, gage.eads,
	chris.macnamara, Anatoly Burakov, david.marchand, nd


On 20/10/2020 3:01 PM, David Hunt wrote:
>
> On 20/10/2020 8:35 AM, Thomas Monjalon wrote:
>> 20/10/2020 04:49, Ruifeng Wang:
>>> From: Thomas Monjalon <thomas@monjalon.net>
>>>> 15/10/2020 14:04, Anatoly Burakov:
>>>>> +/**
>>>>> + * This function is not supported on ARM.
>>>>> + */
>>>>> +static inline void
>>>>> +rte_power_monitor(const volatile void *p, const uint64_t
>>>> expected_value,
>>>>> +               const uint64_t value_mask, const uint64_t 
>>>>> tsc_timestamp,
>>>>> +               const uint8_t data_sz) {
>>>>> +       RTE_SET_USED(p);
>>>>> +       RTE_SET_USED(expected_value);
>>>>> +       RTE_SET_USED(value_mask);
>>>>> +       RTE_SET_USED(tsc_timestamp);
>>>>> +       RTE_SET_USED(data_sz);
>>>>> +}
>>>> Are you sure it cannot be partially supported with WFE instruction?
>>>>
>>> Armv8 WFE instruction can support monitoring of specific address for 
>>> changes,
>>> but not monitoring of TSC timestamp.
>> So it is a partial support.
>>
>> We must try hard to unify architectures support
>> to avoid #ifdef everywhere.
>>
>> I don't agree with how are managed new instructions recently.
>> Please look further.
>>
>
> Hi Thomas,
>
> We believe this is ready for -rc1, can we discuss this with the 
> technical board before the RC1 tag is applied?
>

Hi Thomas,
     By way of further follow-up, here are the reasons why we believe 
it's ready for merge.

There are 18 Acks for the 10 patches, with the two critical patches 
getting 4 acks each.
These acks are from ARM, Marvell, IBM and Intel.
There have been 7 revisions, with quite a lot of discussion, and all 
comments have been addressed and Ack'd.
 From what I can see, the community are in agreement that this patch 
should be merged.

Rgds,
Dave.




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-20 14:17                     ` David Hunt
@ 2020-10-20 14:33                       ` Thomas Monjalon
  2020-10-20 17:26                         ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-20 14:33 UTC (permalink / raw)
  To: David Hunt
  Cc: Honnappa Nagarahalli, Ruifeng Wang, dev, Liang Ma, Jan Viktorin,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	jerinjacobk, timothy.mcdaniel, gage.eads, chris.macnamara,
	Anatoly Burakov, david.marchand, nd

20/10/2020 16:17, David Hunt:
> On 20/10/2020 3:01 PM, David Hunt wrote:
> > On 20/10/2020 8:35 AM, Thomas Monjalon wrote:
> >> 20/10/2020 04:49, Ruifeng Wang:
> >>> From: Thomas Monjalon <thomas@monjalon.net>
> >>>> 15/10/2020 14:04, Anatoly Burakov:
> >>>>> +/**
> >>>>> + * This function is not supported on ARM.
> >>>>> + */
> >>>>> +static inline void
> >>>>> +rte_power_monitor(const volatile void *p, const uint64_t
> >>>> expected_value,
> >>>>> +               const uint64_t value_mask, const uint64_t 
> >>>>> tsc_timestamp,
> >>>>> +               const uint8_t data_sz) {
> >>>>> +       RTE_SET_USED(p);
> >>>>> +       RTE_SET_USED(expected_value);
> >>>>> +       RTE_SET_USED(value_mask);
> >>>>> +       RTE_SET_USED(tsc_timestamp);
> >>>>> +       RTE_SET_USED(data_sz);
> >>>>> +}
> >>>> Are you sure it cannot be partially supported with WFE instruction?
> >>>>
> >>> Armv8 WFE instruction can support monitoring of specific address for 
> >>> changes,
> >>> but not monitoring of TSC timestamp.
> >> So it is a partial support.
> >>
> >> We must try hard to unify architectures support
> >> to avoid #ifdef everywhere.
> >>
> >> I don't agree with how are managed new instructions recently.
> >> Please look further.
> >>
> >
> > Hi Thomas,
> >
> > We believe this is ready for -rc1, can we discuss this with the 
> > technical board before the RC1 tag is applied?
> >
> 
> Hi Thomas,
>      By way of further follow-up, here are the reasons why we believe 
> it's ready for merge.
> 
> There are 18 Acks for the 10 patches, with the two critical patches 
> getting 4 acks each.
> These acks are from ARM, Marvell, IBM and Intel.
> There have been 7 revisions, with quite a lot of discussion, and all 
> comments have been addressed and Ack'd.
>  From what I can see, the community are in agreement that this patch 
> should be merged.

The problem is that I don't agree,
and I feel you tried to avoid comments from others at the beginning.

Now I don't want to spend more time on it before tagging -rc1.

Next time, you'll make sure to Cc and reply everybody.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-20 14:33                       ` Thomas Monjalon
@ 2020-10-20 17:26                         ` Ananyev, Konstantin
  2020-10-20 19:28                           ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-20 17:26 UTC (permalink / raw)
  To: Thomas Monjalon, Hunt, David
  Cc: Honnappa Nagarahalli, Ruifeng Wang, dev, Ma, Liang J,
	Jan Viktorin, David Christensen, Richardson, Bruce, jerinjacobk,
	McDaniel, Timothy, Eads, Gage, Macnamara, Chris, Burakov,
	Anatoly, david.marchand, nd


 
> 20/10/2020 16:17, David Hunt:
> > On 20/10/2020 3:01 PM, David Hunt wrote:
> > > On 20/10/2020 8:35 AM, Thomas Monjalon wrote:
> > >> 20/10/2020 04:49, Ruifeng Wang:
> > >>> From: Thomas Monjalon <thomas@monjalon.net>
> > >>>> 15/10/2020 14:04, Anatoly Burakov:
> > >>>>> +/**
> > >>>>> + * This function is not supported on ARM.
> > >>>>> + */
> > >>>>> +static inline void
> > >>>>> +rte_power_monitor(const volatile void *p, const uint64_t
> > >>>> expected_value,
> > >>>>> +               const uint64_t value_mask, const uint64_t
> > >>>>> tsc_timestamp,
> > >>>>> +               const uint8_t data_sz) {
> > >>>>> +       RTE_SET_USED(p);
> > >>>>> +       RTE_SET_USED(expected_value);
> > >>>>> +       RTE_SET_USED(value_mask);
> > >>>>> +       RTE_SET_USED(tsc_timestamp);
> > >>>>> +       RTE_SET_USED(data_sz);
> > >>>>> +}
> > >>>> Are you sure it cannot be partially supported with WFE instruction?
> > >>>>
> > >>> Armv8 WFE instruction can support monitoring of specific address for
> > >>> changes,
> > >>> but not monitoring of TSC timestamp.
> > >> So it is a partial support.
> > >>
> > >> We must try hard to unify architectures support
> > >> to avoid #ifdef everywhere.
> > >>
> > >> I don't agree with how are managed new instructions recently.
> > >> Please look further.
> > >>
> > >
> > > Hi Thomas,
> > >
> > > We believe this is ready for -rc1, can we discuss this with the
> > > technical board before the RC1 tag is applied?
> > >
> >
> > Hi Thomas,
> >      By way of further follow-up, here are the reasons why we believe
> > it's ready for merge.
> >
> > There are 18 Acks for the 10 patches, with the two critical patches
> > getting 4 acks each.
> > These acks are from ARM, Marvell, IBM and Intel.
> > There have been 7 revisions, with quite a lot of discussion, and all
> > comments have been addressed and Ack'd.
> >  From what I can see, the community are in agreement that this patch
> > should be merged.
> 
> The problem is that I don't agree,

Thomas, could you explain about what exactly you don't agree with?
Is it about WFE? Something else? 
Konstantin

> and I feel you tried to avoid comments from others at the beginning.
> 
> Now I don't want to spend more time on it before tagging -rc1.
> 
> Next time, you'll make sure to Cc and reply everybody.
> 


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics
  2020-10-20 17:26                         ` Ananyev, Konstantin
@ 2020-10-20 19:28                           ` Thomas Monjalon
  0 siblings, 0 replies; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-20 19:28 UTC (permalink / raw)
  To: Hunt, David, Ananyev, Konstantin
  Cc: Honnappa Nagarahalli, Ruifeng Wang, dev, Ma, Liang J,
	Jan Viktorin, David Christensen, Richardson, Bruce, jerinjacobk,
	McDaniel, Timothy, Eads, Gage, Macnamara, Chris, Burakov,
	Anatoly, david.marchand, nd

20/10/2020 19:26, Ananyev, Konstantin:
> > 20/10/2020 16:17, David Hunt:
> > > On 20/10/2020 3:01 PM, David Hunt wrote:
> > > > On 20/10/2020 8:35 AM, Thomas Monjalon wrote:
> > > >> 20/10/2020 04:49, Ruifeng Wang:
> > > >>> From: Thomas Monjalon <thomas@monjalon.net>
> > > >>>> 15/10/2020 14:04, Anatoly Burakov:
> > > >>>>> +/**
> > > >>>>> + * This function is not supported on ARM.
> > > >>>>> + */
> > > >>>>> +static inline void
> > > >>>>> +rte_power_monitor(const volatile void *p, const uint64_t
> > > >>>> expected_value,
> > > >>>>> +               const uint64_t value_mask, const uint64_t
> > > >>>>> tsc_timestamp,
> > > >>>>> +               const uint8_t data_sz) {
> > > >>>>> +       RTE_SET_USED(p);
> > > >>>>> +       RTE_SET_USED(expected_value);
> > > >>>>> +       RTE_SET_USED(value_mask);
> > > >>>>> +       RTE_SET_USED(tsc_timestamp);
> > > >>>>> +       RTE_SET_USED(data_sz);
> > > >>>>> +}
> > > >>>> Are you sure it cannot be partially supported with WFE instruction?
> > > >>>>
> > > >>> Armv8 WFE instruction can support monitoring of specific address for
> > > >>> changes,
> > > >>> but not monitoring of TSC timestamp.
> > > >> So it is a partial support.
> > > >>
> > > >> We must try hard to unify architectures support
> > > >> to avoid #ifdef everywhere.
> > > >>
> > > >> I don't agree with how are managed new instructions recently.
> > > >> Please look further.
> > > >>
> > > >
> > > > Hi Thomas,
> > > >
> > > > We believe this is ready for -rc1, can we discuss this with the
> > > > technical board before the RC1 tag is applied?
> > > >
> > >
> > > Hi Thomas,
> > >      By way of further follow-up, here are the reasons why we believe
> > > it's ready for merge.
> > >
> > > There are 18 Acks for the 10 patches, with the two critical patches
> > > getting 4 acks each.
> > > These acks are from ARM, Marvell, IBM and Intel.
> > > There have been 7 revisions, with quite a lot of discussion, and all
> > > comments have been addressed and Ack'd.
> > >  From what I can see, the community are in agreement that this patch
> > > should be merged.
> > 
> > The problem is that I don't agree,
> 
> Thomas, could you explain about what exactly you don't agree with?
> Is it about WFE? Something else? 

It's about -rc1. I will look at this patchset for -rc2.

> > and I feel you tried to avoid comments from others at the beginning.
> > 
> > Now I don't want to spend more time on it before tagging -rc1.
> > 
> > Next time, you'll make sure to Cc and reply everybody.




^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v8 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
@ 2020-10-23 17:17             ` Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
                                 ` (10 more replies)
  2020-10-23 17:20             ` [dpdk-dev] [PATCH v8 02/10] eal: add power management intrinsics Liang Ma
                               ` (8 subsequent siblings)
  9 siblings, 11 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 17:17 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	Liang Ma

Add a new CPUID flag indicating processor support for UMONITOR/UMWAIT
and TPAUSE instructions instruction.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_eal/x86/include/rte_cpuflags.h | 1 +
 lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..848ba9cbfb 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	RTE_CPUFLAG_WAITPKG,                /**< UMONITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v8 02/10] eal: add power management intrinsics
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
@ 2020-10-23 17:20             ` Liang Ma
  2020-10-23 17:21             ` [dpdk-dev] [PATCH v8 03/10] eal: add intrinsics support check infrastructure Liang Ma
                               ` (7 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 17:20 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, ruifeng.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

For more details, please refer to Intel(R) 64 and IA-32 Architectures
Software Developer's Manual, Volume 2.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_eal/arm/include/meson.build        |   1 +
 .../arm/include/rte_power_intrinsics.h        |  60 ++++++++
 .../include/generic/rte_power_intrinsics.h    | 111 ++++++++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/ppc/include/meson.build        |   1 +
 .../ppc/include/rte_power_intrinsics.h        |  60 ++++++++
 lib/librte_eal/x86/include/meson.build        |   1 +
 .../x86/include/rte_power_intrinsics.h        | 135 ++++++++++++++++++
 8 files changed, 370 insertions(+)
 create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
index 73b750a18f..c6a9f70d73 100644
--- a/lib/librte_eal/arm/include/meson.build
+++ b/lib/librte_eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
 	'rte_pause_32.h',
 	'rte_pause_64.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch_32.h',
 	'rte_prefetch_64.h',
 	'rte_prefetch.h',
diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..a4a1bc1159
--- /dev/null
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_ARM_H_
+#define _RTE_POWER_INTRINSIC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..fb897d9060
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,111 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ * @param data_sz
+ *   Data size (in bytes) that will be used to compare expected value with the
+ *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
+ *   to undefined result.
+ */
+__rte_experimental
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This call will also lock a spinlock on entering sleep, and release it on
+ * waking up the CPU.
+ *
+ * @param p
+ *   Address to monitor for changes.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ * @param data_sz
+ *   Data size (in bytes) that will be used to compare expected value with the
+ *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
+ *   to undefined result.
+ * @param lck
+ *   A spinlock that must be locked before entering the function, will be
+ *   unlocked while the CPU is sleeping, and will be locked again once the CPU
+ *   wakes up.
+ */
+__rte_experimental
+static inline void rte_power_monitor_sync(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz,
+		rte_spinlock_t *lck);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ */
+__rte_experimental
+static inline void rte_power_pause(const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
index ab4bd28092..0873b2aecb 100644
--- a/lib/librte_eal/ppc/include/meson.build
+++ b/lib/librte_eal/ppc/include/meson.build
@@ -10,6 +10,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch.h',
 	'rte_rwlock.h',
 	'rte_spinlock.h',
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..4ed03d521f
--- /dev/null
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_PPC_H_
+#define _RTE_POWER_INTRINSIC_PPC_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..f9b761d796
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_H_
+#define _RTE_POWER_INTRINSIC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+static inline uint64_t
+__get_umwait_val(const volatile void *p, const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):
+		return *(const volatile uint8_t *)p;
+	case sizeof(uint16_t):
+		return *(const volatile uint16_t *)p;
+	case sizeof(uint32_t):
+		return *(const volatile uint32_t *)p;
+	case sizeof(uint64_t):
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		RTE_ASSERT(0);
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+		: /* ignore rflags */
+		: "D"(0), /* enter C0.2 */
+		  "a"(tsc_l), "d"(tsc_h));
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v8 03/10] eal: add intrinsics support check infrastructure
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
  2020-10-23 17:20             ` [dpdk-dev] [PATCH v8 02/10] eal: add power management intrinsics Liang Ma
@ 2020-10-23 17:21             ` Liang Ma
  2020-10-23 17:22             ` [dpdk-dev] [PATCH v8 04/10] ethdev: add simple power management API Liang Ma
                               ` (6 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 17:21 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, ruifeng.wang, drc, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, mdr, nhorman, Liang Ma

Currently, it is not possible to check support for intrinsics that
are platform-specific, cannot be abstracted in a generic way, or do not
have support on all architectures. The CPUID flags can be used to some
extent, but they are only defined for their platform, while intrinsics
will be available to all code as they are in generic headers.

This patch introduces infrastructure to check support for certain
platform-specific intrinsics, and adds support for checking support for
IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Ray Kinsella <mdr@ashroe.eu>
---

Notes:
    v6:
    - Fix the comments
    v8:
    - Rename eal version.map
---
 lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
 lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
 .../include/generic/rte_power_intrinsics.h    | 12 +++++++++
 lib/librte_eal/ppc/rte_cpuflags.c             |  7 +++++
 lib/librte_eal/version.map                    |  1 +
 lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
 6 files changed, 64 insertions(+)

diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
index 7b257b7873..e3a53bcece 100644
--- a/lib/librte_eal/arm/rte_cpuflags.c
+++ b/lib/librte_eal/arm/rte_cpuflags.c
@@ -151,3 +151,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
index 872f0ebe3e..28a5aecde8 100644
--- a/lib/librte_eal/include/generic/rte_cpuflags.h
+++ b/lib/librte_eal/include/generic/rte_cpuflags.h
@@ -13,6 +13,32 @@
 #include "rte_common.h"
 #include <errno.h>
 
+#include <rte_compat.h>
+
+/**
+ * Structure used to describe platform-specific intrinsics that may or may not
+ * be supported at runtime.
+ */
+struct rte_cpu_intrinsics {
+	uint32_t power_monitor : 1;
+	/**< indicates support for rte_power_monitor function */
+	uint32_t power_pause : 1;
+	/**< indicates support for rte_power_pause function */
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Check CPU support for various intrinsics at runtime.
+ *
+ * @param intrinsics
+ *     Pointer to a structure to be filled.
+ */
+__rte_experimental
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
+
 /**
  * Enumeration of all CPU features supported
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index fb897d9060..03a326f076 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -32,6 +32,10 @@
  * checked against the expected value, and if they match, the entering of
  * optimized power state may be aborted.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes.
  * @param expected_value
@@ -69,6 +73,10 @@ static inline void rte_power_monitor(const volatile void *p,
  * This call will also lock a spinlock on entering sleep, and release it on
  * waking up the CPU.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes.
  * @param expected_value
@@ -101,6 +109,10 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  * Enter an architecture-defined optimized power state until a certain TSC
  * timestamp is reached.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
index 3bb7563ce9..61db5c216d 100644
--- a/lib/librte_eal/ppc/rte_cpuflags.c
+++ b/lib/librte_eal/ppc/rte_cpuflags.c
@@ -8,6 +8,7 @@
 #include <elf.h>
 #include <fcntl.h>
 #include <assert.h>
+#include <string.h>
 #include <unistd.h>
 
 /* Symbolic values for the entries in the auxiliary table */
@@ -108,3 +109,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index c23ff57ce6..269cdccfd3 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -402,6 +402,7 @@ EXPERIMENTAL {
 	rte_service_lcore_may_be_active;
 	rte_vect_get_max_simd_bitwidth;
 	rte_vect_set_max_simd_bitwidth;
+	rte_cpu_get_intrinsics_support;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 0325c4b93b..a96312ff7f 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -7,6 +7,7 @@
 #include <stdio.h>
 #include <errno.h>
 #include <stdint.h>
+#include <string.h>
 
 #include "rte_cpuid.h"
 
@@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
+		intrinsics->power_monitor = 1;
+		intrinsics->power_pause = 1;
+	}
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v8 04/10] ethdev: add simple power management API
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
                               ` (2 preceding siblings ...)
  2020-10-23 17:21             ` [dpdk-dev] [PATCH v8 03/10] eal: add intrinsics support check infrastructure Liang Ma
@ 2020-10-23 17:22             ` Liang Ma
  2020-10-23 17:23             ` [dpdk-dev] [PATCH v8 05/10] power: add PMD power management API and callback Liang Ma
                               ` (5 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 17:22 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, ferruh.yigit, andrew.rybchenko, nhorman,
	bruce.richardson, konstantin.ananyev, david.hunt, jerinjacobk,
	thomas, timothy.mcdaniel, gage.eads, mdr, Liang Ma

Add a simple API to allow getting address of next RX descriptor from the
PMD, as well as release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v8:
    - Rename version map file name.

    v7:
    - Fixed queue ID validation
    - Fixed documentation

    v6:
    - Rebase on top of latest main
    - Ensure the API checks queue ID (Konstantin)
    - Removed accidental inclusion of unrelated release notes
    v5:
    - Bring function format in line with other functions in the file
    - Ensure the API is supported by the driver before calling it (Konstantin)
---
 doc/guides/rel_notes/release_20_11.rst |  5 +++++
 lib/librte_ethdev/rte_ethdev.c         | 23 +++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h         | 28 ++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h  | 28 ++++++++++++++++++++++++++
 lib/librte_ethdev/version.map          |  1 +
 5 files changed, 85 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index d8ac359e51..2827a000db 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -139,6 +139,11 @@ New Features
   Hairpin Tx part flow rules can be inserted explicitly.
   New API is added to get the hairpin peer ports list.
 
+* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**
+
+  * ``rte_eth_get_wake_addr()``
+  * add new eth_dev_ops ``get_wake_addr``
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index b12bb3854d..4f3115fe8e 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -5138,6 +5138,29 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_wake_addr(dev->data->rx_queues[queue_id],
+			wake_addr, expected, mask, data_sz));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index e341a08817..11559e7bc8 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4364,6 +4364,34 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * Retrieve the wake up address for the receive queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information will be
+ *   retrieved.
+ * @param wake_addr
+ *   The pointer to the address which will be monitored.
+ * @param expected
+ *   The pointer to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer to comparison bitmask for the expected value.
+ * @param data_sz
+ *   The pointer to data size for the expected value and comparison bitmask.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index c63b9f7eb7..d7548dfe74 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -752,6 +752,32 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
 	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
 /**< @internal Unbind peer queue from the current queue. */
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an RX queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param tail_desc_addr
+ *   The pointer point to where the address will be stored.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ * @param data_sz
+ *   Data size for the expected value (can be 1, 2, 4, or 8 bytes)
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -910,6 +936,8 @@ struct eth_dev_ops {
 	/**< Set up the connection between the pair of hairpin queues. */
 	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
 	/**< Disconnect the hairpin queues of a pair from each other. */
+	eth_get_wake_addr_t get_wake_addr;
+	/**< Get next RX queue ring entry address. */
 };
 
 /**
diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
index 8ddda2547f..392c273712 100644
--- a/lib/librte_ethdev/version.map
+++ b/lib/librte_ethdev/version.map
@@ -244,6 +244,7 @@ EXPERIMENTAL {
 	rte_flow_get_restore_info;
 	rte_flow_tunnel_action_decap_release;
 	rte_flow_tunnel_item_release;
+	rte_eth_get_wake_addr;
 };
 
 INTERNAL {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v8 05/10] power: add PMD power management API and callback
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
                               ` (3 preceding siblings ...)
  2020-10-23 17:22             ` [dpdk-dev] [PATCH v8 04/10] ethdev: add simple power management API Liang Ma
@ 2020-10-23 17:23             ` Liang Ma
  2020-10-23 17:26             ` [dpdk-dev] [PATCH v8 06/10] net/ixgbe: implement power management API Liang Ma
                               ` (4 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 17:23 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, nhorman, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	mdr, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. Pause instruction

   Instead of move the core into deeper C state, this method uses the
   pause instruction to avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v8:
    - Rename version map file name

    v7:
    - Fixed race condition (Konstantin)
    - Slight rework of the structure of monitor code
    - Added missing inline for wakeup

    v6:
    - Added wakeup mechanism for UMWAIT
    - Removed memory allocation (everything is now allocated statically)
    - Fixed various typos and comments
    - Check for invalid queue ID
    - Moved release notes to this patch

    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
    - Add some debug logging
    - Replace x86-specific code path to generic path using the intrinsic check
---
 doc/guides/rel_notes/release_20_11.rst |  11 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 320 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  92 +++++++
 lib/librte_power/version.map           |   4 +
 5 files changed, 430 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 2827a000db..5f32a5da1d 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -350,6 +350,17 @@ New Features
   * Replaced ``--scalar`` command-line option with ``--alg=<value>``, to allow
     the user to select the desired classify method.
 
+* **Add PMD power management mechanism**
+
+  3 new Ethernet PMD power management mechanism is added through existing
+  RX callback infrastructure.
+
+  * Add power saving scheme based on UMWAIT instruction (x86 only)
+  * Add power saving scheme based on ``rte_pause()``
+  * Add power saving scheme based on frequency scaling through the power library
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..0dcaddc3bd
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,320 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	rte_spinlock_t umwait_lock;
+	/**< Per-queue status lock - used only for UMWAIT mode */
+	volatile void *wait_addr;
+	/**< UMWAIT wakeup address */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+/* trigger a write to the cache line we're waiting on */
+static inline void
+umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
+static inline void
+umwait_sleep(struct pmd_queue_cfg *q_conf, uint16_t port_id, uint16_t qidx)
+{
+	volatile void *target_addr;
+	uint64_t expected, mask;
+	uint8_t data_sz;
+	uint16_t ret;
+
+	/*
+	 * get wake up address for this RX queue, as well as expected value,
+	 * comparison mask, and data size.
+	 */
+	ret = rte_eth_get_wake_addr(port_id, qidx, &target_addr,
+			&expected, &mask, &data_sz);
+
+	/* this should always succeed as all checks have been done already */
+	if (unlikely(ret != 0))
+		return;
+
+	/*
+	 * take out a spinlock to prevent control plane from concurrently
+	 * modifying the wakeup data.
+	 */
+	rte_spinlock_lock(&q_conf->umwait_lock);
+
+	/* have we been disabled by control plane? */
+	if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		/* we're good to go */
+
+		/*
+		 * store the wakeup address so that control plane can trigger a
+		 * write to this address and wake us up.
+		 */
+		q_conf->wait_addr = target_addr;
+		/* -1ULL is maximum value for TSC */
+		rte_power_monitor_sync(target_addr, expected, mask, -1ULL,
+				data_sz, &q_conf->umwait_lock);
+		/* erase the address */
+		q_conf->wait_addr = NULL;
+	}
+	rte_spinlock_unlock(&q_conf->umwait_lock);
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			umwait_sleep(q_conf, port_id, qidx);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			rte_delay_us(1);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* check if queue id is valid */
+	if (queue_id >= dev->data->nb_rx_queues ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		return -EINVAL;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	{
+		/* check if rte_power_monitor is supported */
+		uint64_t dummy_expected, dummy_mask;
+		struct rte_cpu_intrinsics i;
+		volatile void *dummy_addr;
+		uint8_t dummy_sz;
+
+		rte_cpu_get_intrinsics_support(&i);
+
+		if (!i.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_wake_addr(port_id, queue_id,
+				&dummy_addr, &dummy_expected,
+				&dummy_mask, &dummy_sz) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize UMWAIT spinlock */
+		rte_spinlock_init(&queue_cfg->umwait_lock);
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+		rte_spinlock_lock(&queue_cfg->umwait_lock);
+
+		/* wake up the core from UMWAIT sleep, if any */
+		if (queue_cfg->wait_addr != NULL)
+			umwait_wakeup(queue_cfg->wait_addr);
+		/*
+		 * we need to disable early as there might be callback currently
+		 * spinning on a lock.
+		 */
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+		rte_spinlock_unlock(&queue_cfg->umwait_lock);
+		/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	ret = 0;
+end:
+	return ret;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..a7a3f98268
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v8 06/10] net/ixgbe: implement power management API
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
                               ` (4 preceding siblings ...)
  2020-10-23 17:23             ` [dpdk-dev] [PATCH v8 05/10] power: add PMD power management API and callback Liang Ma
@ 2020-10-23 17:26             ` Liang Ma
  2020-10-23 17:26             ` [dpdk-dev] [PATCH v8 07/10] net/i40e: " Liang Ma
                               ` (3 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 17:26 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, jia.guo, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, Liang Ma

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 00101c2eec..fcc4026372 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_wake_addr        = ixgbe_get_wake_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d1d3baff90..096dff37ba 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1367,6 +1367,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	*data_sz = 4;
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 6d2f7c9da3..1ef0b05e66 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v8 07/10] net/i40e: implement power management API
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
                               ` (5 preceding siblings ...)
  2020-10-23 17:26             ` [dpdk-dev] [PATCH v8 06/10] net/ixgbe: implement power management API Liang Ma
@ 2020-10-23 17:26             ` Liang Ma
  2020-10-23 17:27             ` [dpdk-dev] [PATCH v8 08/10] net/ice: " Liang Ma
                               ` (2 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 17:26 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, jia.guo, beilei.xing, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, Liang Ma

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 4778aaf299..358a38232b 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_wake_addr	              = i40e_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5df9a9df56..78862fe3a2 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -72,6 +72,32 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	*data_sz = 8;
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..5826cf1099 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value, uint8_t *data_sz);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v8 08/10] net/ice: implement power management API
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
                               ` (6 preceding siblings ...)
  2020-10-23 17:26             ` [dpdk-dev] [PATCH v8 07/10] net/i40e: " Liang Ma
@ 2020-10-23 17:27             ` Liang Ma
  2020-10-23 17:30             ` [dpdk-dev] [PATCH v8 09/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
  2020-10-23 17:36             ` [dpdk-dev] [PATCH v8 10/10] doc: update programmer's guide for power library Liang Ma
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 17:27 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, qiming.yang, qi.z.zhang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, Liang Ma

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index c65125ff32..54f185ad4d 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_wake_addr	              = ice_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index ee576c362a..fafd6ada62 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	*data_sz = 2;
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1c23c7541e..7eeb8d467e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v8 09/10] examples/l3fwd-power: enable PMD power mgmt
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
                               ` (7 preceding siblings ...)
  2020-10-23 17:27             ` [dpdk-dev] [PATCH v8 08/10] net/ice: " Liang Ma
@ 2020-10-23 17:30             ` Liang Ma
  2020-10-23 17:36             ` [dpdk-dev] [PATCH v8 10/10] doc: update programmer's guide for power library Liang Ma
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 17:30 UTC (permalink / raw)
  To: dev
  Cc: konstantin.ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, Liang Ma, Anatoly Burakov

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v8:
    - Add return status check for queue enable

    v6:
    - Fixed typos in documentation
---
 .../sample_app_ug/l3_forward_power_man.rst    | 13 ++++++
 examples/l3fwd-power/main.c                   | 46 ++++++++++++++++++-
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index d7e1dc5813..b10ebe662e 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -455,3 +457,14 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index a48d75f68f..aafa415f0b 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,7 +200,8 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
@@ -1750,6 +1752,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1771,6 +1774,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1881,6 +1885,16 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt  mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2437,6 +2451,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2705,6 +2721,17 @@ main(int argc, char **argv)
 			} else if (!check_ptype(portid))
 				rte_exit(EXIT_FAILURE,
 					 "PMD can not provide needed ptypes\n");
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_pmd_mgmt_queue_enable(lcore_id,
+						     portid, queueid,
+						     RTE_POWER_MGMT_TYPE_SCALE);
+
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+					"rte_power_pmd_mgmt enable: err=%d, "
+					"port=%d\n", ret, portid);
+
+			}
 		}
 	}
 
@@ -2790,6 +2817,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL,
+					 CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2816,6 +2846,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v8 10/10] doc: update programmer's guide for power library
  2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
                               ` (8 preceding siblings ...)
  2020-10-23 17:30             ` [dpdk-dev] [PATCH v8 09/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
@ 2020-10-23 17:36             ` Liang Ma
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 17:36 UTC (permalink / raw)
  To: dev
  Cc: konstantin.ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, Liang Ma, Anatoly Burakov

Update programmer's guide to document PMD power management usage.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---
 doc/guides/prog_guide/power_man.rst | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..38c64d31e4 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,45 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * UMWAIT/UMONITOR
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX descriptor
+   address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will use the `rte_pause` function to avoid busy
+   polling.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing power library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +239,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 0/9] " Liang Ma
                                   ` (9 more replies)
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                                 ` (9 subsequent siblings)
  10 siblings, 10 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

This patchset proposes a simple API for Ethernet drivers
to cause the CPU to enter a power-optimized state while
waiting for packets to arrive, along with a set of
generic intrinsics that facilitate that. This is achieved
through cooperation with the NIC driver that will allow
us to know address of wake up event, and wait for writes
on it.

On IA, this is achieved through using UMONITOR/UMWAIT
instructions. They are used in their raw opcode form
because there is no widespread compiler support for
them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen
to implement similar instructions.

To achieve power savings, there is a very simple mechanism
used: we're counting empty polls, and if a certain threshold
is reached, we get the address of next RX ring descriptor
from the NIC driver, arm the monitoring hardware, and
enter a power-optimized state. We will then wake up when
either a timeout happens, or a write happens (or generally
whenever CPU feels like waking up - this is platform-
specific), and proceed as normal. The empty poll counter is
reset whenever we actually get packets, so we only go to
sleep when we know nothing is going on. The mechanism is
generic which can be used for any write back descriptor.

Why are we putting it into ethdev as opposed to leaving
this up to the application? Our customers specifically
requested a way to do it wit minimal changes to the
application code. The current approach allows to just
flip a switch and automatically have power savings.

- Only 1:1 core to queue mapping is supported,
  meaning that each lcore must at most handle RX on a
  single queue
- Support 3 type policies. UMWAIT/PAUSE/Frequency_Scale
- Power management is enabled per-queue
- The API doesn't extend to other device types

Liang Ma (10):
  eal: add new x86 cpuid support for WAITPKG
  eal: add power management intrinsics
  eal: add intrinsics support check infrastructure
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt
  doc: update programmer's guide for power library

 doc/guides/prog_guide/power_man.rst           |  42 +++
 doc/guides/rel_notes/release_20_11.rst        |  16 +
 .../sample_app_ug/l3_forward_power_man.rst    |  13 +
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  26 ++
 drivers/net/i40e/i40e_rxtx.h                  |   2 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   2 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
 examples/l3fwd-power/main.c                   |  46 ++-
 lib/librte_eal/arm/include/meson.build        |   1 +
 .../arm/include/rte_power_intrinsics.h        |  60 ++++
 lib/librte_eal/arm/rte_cpuflags.c             |   6 +
 lib/librte_eal/include/generic/rte_cpuflags.h |  26 ++
 .../include/generic/rte_power_intrinsics.h    | 123 +++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/ppc/include/meson.build        |   1 +
 .../ppc/include/rte_power_intrinsics.h        |  60 ++++
 lib/librte_eal/ppc/rte_cpuflags.c             |   7 +
 lib/librte_eal/version.map                    |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 135 ++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |  14 +
 lib/librte_ethdev/rte_ethdev.c                |  23 ++
 lib/librte_ethdev/rte_ethdev.h                |  28 ++
 lib/librte_ethdev/rte_ethdev_driver.h         |  28 ++
 lib/librte_ethdev/version.map                 |   1 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 320 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  92 +++++
 lib/librte_power/version.map                  |   4 +
 35 files changed, 1138 insertions(+), 3 deletions(-)
 create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 02/10] eal: add power management intrinsics Liang Ma
                                 ` (8 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Add a new CPUID flag indicating processor support for UMONITOR/UMWAIT
and TPAUSE instructions instruction.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_eal/x86/include/rte_cpuflags.h | 1 +
 lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..848ba9cbfb 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	RTE_CPUFLAG_WAITPKG,                /**< UMONITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 02/10] eal: add power management intrinsics
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 03/10] eal: add intrinsics support check infrastructure Liang Ma
                                 ` (7 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

For more details, please refer to Intel(R) 64 and IA-32 Architectures
Software Developer's Manual, Volume 2.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_eal/arm/include/meson.build        |   1 +
 .../arm/include/rte_power_intrinsics.h        |  60 ++++++++
 .../include/generic/rte_power_intrinsics.h    | 111 ++++++++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/ppc/include/meson.build        |   1 +
 .../ppc/include/rte_power_intrinsics.h        |  60 ++++++++
 lib/librte_eal/x86/include/meson.build        |   1 +
 .../x86/include/rte_power_intrinsics.h        | 135 ++++++++++++++++++
 8 files changed, 370 insertions(+)
 create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
index 73b750a18f..c6a9f70d73 100644
--- a/lib/librte_eal/arm/include/meson.build
+++ b/lib/librte_eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
 	'rte_pause_32.h',
 	'rte_pause_64.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch_32.h',
 	'rte_prefetch_64.h',
 	'rte_prefetch.h',
diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..a4a1bc1159
--- /dev/null
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_ARM_H_
+#define _RTE_POWER_INTRINSIC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..fb897d9060
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,111 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ * @param data_sz
+ *   Data size (in bytes) that will be used to compare expected value with the
+ *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
+ *   to undefined result.
+ */
+__rte_experimental
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This call will also lock a spinlock on entering sleep, and release it on
+ * waking up the CPU.
+ *
+ * @param p
+ *   Address to monitor for changes.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ * @param data_sz
+ *   Data size (in bytes) that will be used to compare expected value with the
+ *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
+ *   to undefined result.
+ * @param lck
+ *   A spinlock that must be locked before entering the function, will be
+ *   unlocked while the CPU is sleeping, and will be locked again once the CPU
+ *   wakes up.
+ */
+__rte_experimental
+static inline void rte_power_monitor_sync(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz,
+		rte_spinlock_t *lck);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ */
+__rte_experimental
+static inline void rte_power_pause(const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
index ab4bd28092..0873b2aecb 100644
--- a/lib/librte_eal/ppc/include/meson.build
+++ b/lib/librte_eal/ppc/include/meson.build
@@ -10,6 +10,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch.h',
 	'rte_rwlock.h',
 	'rte_spinlock.h',
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..4ed03d521f
--- /dev/null
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_PPC_H_
+#define _RTE_POWER_INTRINSIC_PPC_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..f9b761d796
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_H_
+#define _RTE_POWER_INTRINSIC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+static inline uint64_t
+__get_umwait_val(const volatile void *p, const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):
+		return *(const volatile uint8_t *)p;
+	case sizeof(uint16_t):
+		return *(const volatile uint16_t *)p;
+	case sizeof(uint32_t):
+		return *(const volatile uint32_t *)p;
+	case sizeof(uint64_t):
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		RTE_ASSERT(0);
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+		: /* ignore rflags */
+		: "D"(0), /* enter C0.2 */
+		  "a"(tsc_l), "d"(tsc_h));
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 03/10] eal: add intrinsics support check infrastructure
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
                                 ` (2 preceding siblings ...)
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 02/10] eal: add power management intrinsics Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API Liang Ma
                                 ` (6 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Currently, it is not possible to check support for intrinsics that
are platform-specific, cannot be abstracted in a generic way, or do not
have support on all architectures. The CPUID flags can be used to some
extent, but they are only defined for their platform, while intrinsics
will be available to all code as they are in generic headers.

This patch introduces infrastructure to check support for certain
platform-specific intrinsics, and adds support for checking support for
IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Ray Kinsella <mdr@ashroe.eu>
---

Notes:
    v6:
    - Fix the comments
    v8:
    - Rename eal version.map
---
 lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
 lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
 .../include/generic/rte_power_intrinsics.h    | 12 +++++++++
 lib/librte_eal/ppc/rte_cpuflags.c             |  7 +++++
 lib/librte_eal/version.map                    |  1 +
 lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
 6 files changed, 64 insertions(+)

diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
index 7b257b7873..e3a53bcece 100644
--- a/lib/librte_eal/arm/rte_cpuflags.c
+++ b/lib/librte_eal/arm/rte_cpuflags.c
@@ -151,3 +151,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
index 872f0ebe3e..28a5aecde8 100644
--- a/lib/librte_eal/include/generic/rte_cpuflags.h
+++ b/lib/librte_eal/include/generic/rte_cpuflags.h
@@ -13,6 +13,32 @@
 #include "rte_common.h"
 #include <errno.h>
 
+#include <rte_compat.h>
+
+/**
+ * Structure used to describe platform-specific intrinsics that may or may not
+ * be supported at runtime.
+ */
+struct rte_cpu_intrinsics {
+	uint32_t power_monitor : 1;
+	/**< indicates support for rte_power_monitor function */
+	uint32_t power_pause : 1;
+	/**< indicates support for rte_power_pause function */
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Check CPU support for various intrinsics at runtime.
+ *
+ * @param intrinsics
+ *     Pointer to a structure to be filled.
+ */
+__rte_experimental
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
+
 /**
  * Enumeration of all CPU features supported
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index fb897d9060..03a326f076 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -32,6 +32,10 @@
  * checked against the expected value, and if they match, the entering of
  * optimized power state may be aborted.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes.
  * @param expected_value
@@ -69,6 +73,10 @@ static inline void rte_power_monitor(const volatile void *p,
  * This call will also lock a spinlock on entering sleep, and release it on
  * waking up the CPU.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes.
  * @param expected_value
@@ -101,6 +109,10 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  * Enter an architecture-defined optimized power state until a certain TSC
  * timestamp is reached.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
index 3bb7563ce9..61db5c216d 100644
--- a/lib/librte_eal/ppc/rte_cpuflags.c
+++ b/lib/librte_eal/ppc/rte_cpuflags.c
@@ -8,6 +8,7 @@
 #include <elf.h>
 #include <fcntl.h>
 #include <assert.h>
+#include <string.h>
 #include <unistd.h>
 
 /* Symbolic values for the entries in the auxiliary table */
@@ -108,3 +109,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index c23ff57ce6..269cdccfd3 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -402,6 +402,7 @@ EXPERIMENTAL {
 	rte_service_lcore_may_be_active;
 	rte_vect_get_max_simd_bitwidth;
 	rte_vect_set_max_simd_bitwidth;
+	rte_cpu_get_intrinsics_support;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 0325c4b93b..a96312ff7f 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -7,6 +7,7 @@
 #include <stdio.h>
 #include <errno.h>
 #include <stdint.h>
+#include <string.h>
 
 #include "rte_cpuid.h"
 
@@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
+		intrinsics->power_monitor = 1;
+		intrinsics->power_pause = 1;
+	}
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
                                 ` (3 preceding siblings ...)
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 03/10] eal: add intrinsics support check infrastructure Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-24 20:39                 ` Thomas Monjalon
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 05/10] power: add PMD power management API and callback Liang Ma
                                 ` (5 subsequent siblings)
  10 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Add a simple API to allow getting address of next RX descriptor from the
PMD, as well as release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v8:
    - Rename version map file name.

    v7:
    - Fixed queue ID validation
    - Fixed documentation

    v6:
    - Rebase on top of latest main
    - Ensure the API checks queue ID (Konstantin)
    - Removed accidental inclusion of unrelated release notes
    v5:
    - Bring function format in line with other functions in the file
    - Ensure the API is supported by the driver before calling it (Konstantin)
---
 doc/guides/rel_notes/release_20_11.rst |  5 +++++
 lib/librte_ethdev/rte_ethdev.c         | 23 +++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h         | 28 ++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h  | 28 ++++++++++++++++++++++++++
 lib/librte_ethdev/version.map          |  1 +
 5 files changed, 85 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index d8ac359e51..2827a000db 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -139,6 +139,11 @@ New Features
   Hairpin Tx part flow rules can be inserted explicitly.
   New API is added to get the hairpin peer ports list.
 
+* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**
+
+  * ``rte_eth_get_wake_addr()``
+  * add new eth_dev_ops ``get_wake_addr``
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index b12bb3854d..4f3115fe8e 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -5138,6 +5138,29 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_wake_addr(dev->data->rx_queues[queue_id],
+			wake_addr, expected, mask, data_sz));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index e341a08817..11559e7bc8 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4364,6 +4364,34 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * Retrieve the wake up address for the receive queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information will be
+ *   retrieved.
+ * @param wake_addr
+ *   The pointer to the address which will be monitored.
+ * @param expected
+ *   The pointer to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer to comparison bitmask for the expected value.
+ * @param data_sz
+ *   The pointer to data size for the expected value and comparison bitmask.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index c63b9f7eb7..d7548dfe74 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -752,6 +752,32 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
 	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
 /**< @internal Unbind peer queue from the current queue. */
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an RX queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param tail_desc_addr
+ *   The pointer point to where the address will be stored.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ * @param data_sz
+ *   Data size for the expected value (can be 1, 2, 4, or 8 bytes)
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -910,6 +936,8 @@ struct eth_dev_ops {
 	/**< Set up the connection between the pair of hairpin queues. */
 	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
 	/**< Disconnect the hairpin queues of a pair from each other. */
+	eth_get_wake_addr_t get_wake_addr;
+	/**< Get next RX queue ring entry address. */
 };
 
 /**
diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
index 8ddda2547f..392c273712 100644
--- a/lib/librte_ethdev/version.map
+++ b/lib/librte_ethdev/version.map
@@ -244,6 +244,7 @@ EXPERIMENTAL {
 	rte_flow_get_restore_info;
 	rte_flow_tunnel_action_decap_release;
 	rte_flow_tunnel_item_release;
+	rte_eth_get_wake_addr;
 };
 
 INTERNAL {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 05/10] power: add PMD power management API and callback
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
                                 ` (4 preceding siblings ...)
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 06/10] net/ixgbe: implement power management API Liang Ma
                                 ` (4 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. Pause instruction

   Instead of move the core into deeper C state, this method uses the
   pause instruction to avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v8:
    - Rename version map file name

    v7:
    - Fixed race condition (Konstantin)
    - Slight rework of the structure of monitor code
    - Added missing inline for wakeup

    v6:
    - Added wakeup mechanism for UMWAIT
    - Removed memory allocation (everything is now allocated statically)
    - Fixed various typos and comments
    - Check for invalid queue ID
    - Moved release notes to this patch

    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
    - Add some debug logging
    - Replace x86-specific code path to generic path using the intrinsic check
---
 doc/guides/rel_notes/release_20_11.rst |  11 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 320 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  92 +++++++
 lib/librte_power/version.map           |   4 +
 5 files changed, 430 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 2827a000db..5f32a5da1d 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -350,6 +350,17 @@ New Features
   * Replaced ``--scalar`` command-line option with ``--alg=<value>``, to allow
     the user to select the desired classify method.
 
+* **Add PMD power management mechanism**
+
+  3 new Ethernet PMD power management mechanism is added through existing
+  RX callback infrastructure.
+
+  * Add power saving scheme based on UMWAIT instruction (x86 only)
+  * Add power saving scheme based on ``rte_pause()``
+  * Add power saving scheme based on frequency scaling through the power library
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..0dcaddc3bd
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,320 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	rte_spinlock_t umwait_lock;
+	/**< Per-queue status lock - used only for UMWAIT mode */
+	volatile void *wait_addr;
+	/**< UMWAIT wakeup address */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+/* trigger a write to the cache line we're waiting on */
+static inline void
+umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
+static inline void
+umwait_sleep(struct pmd_queue_cfg *q_conf, uint16_t port_id, uint16_t qidx)
+{
+	volatile void *target_addr;
+	uint64_t expected, mask;
+	uint8_t data_sz;
+	uint16_t ret;
+
+	/*
+	 * get wake up address for this RX queue, as well as expected value,
+	 * comparison mask, and data size.
+	 */
+	ret = rte_eth_get_wake_addr(port_id, qidx, &target_addr,
+			&expected, &mask, &data_sz);
+
+	/* this should always succeed as all checks have been done already */
+	if (unlikely(ret != 0))
+		return;
+
+	/*
+	 * take out a spinlock to prevent control plane from concurrently
+	 * modifying the wakeup data.
+	 */
+	rte_spinlock_lock(&q_conf->umwait_lock);
+
+	/* have we been disabled by control plane? */
+	if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		/* we're good to go */
+
+		/*
+		 * store the wakeup address so that control plane can trigger a
+		 * write to this address and wake us up.
+		 */
+		q_conf->wait_addr = target_addr;
+		/* -1ULL is maximum value for TSC */
+		rte_power_monitor_sync(target_addr, expected, mask, -1ULL,
+				data_sz, &q_conf->umwait_lock);
+		/* erase the address */
+		q_conf->wait_addr = NULL;
+	}
+	rte_spinlock_unlock(&q_conf->umwait_lock);
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			umwait_sleep(q_conf, port_id, qidx);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			rte_delay_us(1);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* check if queue id is valid */
+	if (queue_id >= dev->data->nb_rx_queues ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		return -EINVAL;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	{
+		/* check if rte_power_monitor is supported */
+		uint64_t dummy_expected, dummy_mask;
+		struct rte_cpu_intrinsics i;
+		volatile void *dummy_addr;
+		uint8_t dummy_sz;
+
+		rte_cpu_get_intrinsics_support(&i);
+
+		if (!i.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_wake_addr(port_id, queue_id,
+				&dummy_addr, &dummy_expected,
+				&dummy_mask, &dummy_sz) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize UMWAIT spinlock */
+		rte_spinlock_init(&queue_cfg->umwait_lock);
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+		rte_spinlock_lock(&queue_cfg->umwait_lock);
+
+		/* wake up the core from UMWAIT sleep, if any */
+		if (queue_cfg->wait_addr != NULL)
+			umwait_wakeup(queue_cfg->wait_addr);
+		/*
+		 * we need to disable early as there might be callback currently
+		 * spinning on a lock.
+		 */
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+		rte_spinlock_unlock(&queue_cfg->umwait_lock);
+		/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	ret = 0;
+end:
+	return ret;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..a7a3f98268
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 06/10] net/ixgbe: implement power management API
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
                                 ` (5 preceding siblings ...)
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 05/10] power: add PMD power management API and callback Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 07/10] net/i40e: " Liang Ma
                                 ` (3 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 00101c2eec..fcc4026372 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_wake_addr        = ixgbe_get_wake_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d1d3baff90..096dff37ba 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1367,6 +1367,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	*data_sz = 4;
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 6d2f7c9da3..1ef0b05e66 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 07/10] net/i40e: implement power management API
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
                                 ` (6 preceding siblings ...)
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 06/10] net/ixgbe: implement power management API Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 08/10] net/ice: " Liang Ma
                                 ` (2 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 4778aaf299..358a38232b 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_wake_addr	              = i40e_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5df9a9df56..78862fe3a2 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -72,6 +72,32 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	*data_sz = 8;
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..5826cf1099 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value, uint8_t *data_sz);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 08/10] net/ice: implement power management API
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
                                 ` (7 preceding siblings ...)
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 07/10] net/i40e: " Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 09/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 10/10] doc: update programmer's guide for power library Liang Ma
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index c65125ff32..54f185ad4d 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_wake_addr	              = ice_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index ee576c362a..fafd6ada62 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	*data_sz = 2;
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1c23c7541e..7eeb8d467e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 09/10] examples/l3fwd-power: enable PMD power mgmt
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
                                 ` (8 preceding siblings ...)
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 08/10] net/ice: " Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 10/10] doc: update programmer's guide for power library Liang Ma
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v8:
    - Add return status check for queue enable

    v6:
    - Fixed typos in documentation
---
 .../sample_app_ug/l3_forward_power_man.rst    | 13 ++++++
 examples/l3fwd-power/main.c                   | 46 ++++++++++++++++++-
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index d7e1dc5813..b10ebe662e 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -455,3 +457,14 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index a48d75f68f..aafa415f0b 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,7 +200,8 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
@@ -1750,6 +1752,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1771,6 +1774,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1881,6 +1885,16 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt  mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2437,6 +2451,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2705,6 +2721,17 @@ main(int argc, char **argv)
 			} else if (!check_ptype(portid))
 				rte_exit(EXIT_FAILURE,
 					 "PMD can not provide needed ptypes\n");
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_pmd_mgmt_queue_enable(lcore_id,
+						     portid, queueid,
+						     RTE_POWER_MGMT_TYPE_SCALE);
+
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+					"rte_power_pmd_mgmt enable: err=%d, "
+					"port=%d\n", ret, portid);
+
+			}
 		}
 	}
 
@@ -2790,6 +2817,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL,
+					 CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2816,6 +2846,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v9 10/10] doc: update programmer's guide for power library
  2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
                                 ` (9 preceding siblings ...)
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 09/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
@ 2020-10-23 23:06               ` Liang Ma
  2020-10-24 20:49                 ` Thomas Monjalon
  10 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-10-23 23:06 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang, beilei.xing,
	jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas,
	timothy.mcdaniel, gage.eads, drc, Liang Ma

Update programmer's guide to document PMD power management usage.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---
 doc/guides/prog_guide/power_man.rst | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..38c64d31e4 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,45 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * UMWAIT/UMONITOR
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX descriptor
+   address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will use the `rte_pause` function to avoid busy
+   polling.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing power library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +239,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API Liang Ma
@ 2020-10-24 20:39                 ` Thomas Monjalon
  2020-10-27 11:15                   ` Liang, Ma
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-24 20:39 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang,
	beilei.xing, jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, drc, Andrew Rybchenko, ferruh.yigit,
	jerinj, hemant.agrawal, viacheslavo, matan, ajit.khaparde,
	rahul.lakkireddy, johndale, xavier.huwei, shahafs, sthemmin,
	g.singh, rmody, maxime.coquelin, david.marchand

24/10/2020 01:06, Liang Ma:
> Add a simple API to allow getting address of next RX descriptor from the
> PMD, as well as release notes information.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

You keep forgetting Cc ethdev maintainers
(it is automatic when using --cc-cmd devtools/get-maintainer.sh).
As a result we still don't have any feedback from Ferruh and Andrew.
And I believe such API requires having feedback from maintainers
of other NIC drivers. I tried to explain this concern already
in email and community meetings, but I see no progress.

Below are my comments.

> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -139,6 +139,11 @@ New Features
>    Hairpin Tx part flow rules can be inserted explicitly.
>    New API is added to get the hairpin peer ports list.
>  
> +* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**

The guidelines at the top of the file say using past tense.
No need to mention it is experimental as any new API.

> +
> +  * ``rte_eth_get_wake_addr()``
> +  * add new eth_dev_ops ``get_wake_addr``

No need to mention the dev_ops in the release notes.
Better to explain what it brings from an user perspective.

> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> +/**
> + * Retrieve the wake up address for the receive queue.

I guess how this function should be used,
but a bit more explanations would not hurt here.

> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Rx queue on the Ethernet device for which information will be
> + *   retrieved.
> + * @param wake_addr
> + *   The pointer to the address which will be monitored.

This function does not make the address monitored, right?

> + * @param expected
> + *   The pointer to value to be expected when descriptor is set.

Not sure we should restrict it to a "descriptor".

Expecting a value or some bits looks too much restrictive.
I understand it probably fits well for Intel NICs,
but in the general case, we can imagine that any change
in a byte array could be a wake up signal.

> + * @param mask
> + *   The pointer to comparison bitmask for the expected value.
> + * @param data_sz
> + *   The pointer to data size for the expected value and comparison bitmask.

It is not clear that above 4 parameters are filled by the driver.

> + *
> + * @return
> + *   - 0: Success.
> + *   -ENOTSUP: Operation not supported.
> + *   -EINVAL: Invalid parameters.
> + *   -ENODEV: Invalid port ID.
> + */
> +__rte_experimental
> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
> +		uint8_t *data_sz);
[...]
> --- a/lib/librte_ethdev/version.map
> +++ b/lib/librte_ethdev/version.map
> @@ -244,6 +244,7 @@ EXPERIMENTAL {
>  	rte_flow_get_restore_info;
>  	rte_flow_tunnel_action_decap_release;
>  	rte_flow_tunnel_item_release;
> +	rte_eth_get_wake_addr;
>  };

Please sort in alphabetical order.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v9 10/10] doc: update programmer's guide for power library
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 10/10] doc: update programmer's guide for power library Liang Ma
@ 2020-10-24 20:49                 ` Thomas Monjalon
  2020-10-27 11:04                   ` Liang, Ma
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-24 20:49 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang,
	beilei.xing, jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, drc

24/10/2020 01:06, Liang Ma:
> Update programmer's guide to document PMD power management usage.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Hunt <david.hunt@intel.com>
> ---
>  doc/guides/prog_guide/power_man.rst | 42 +++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)

Why don't you update this doc in the patch 5
adding the code in the power library?


> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -192,6 +192,45 @@ User Cases
>  ----------
>  The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
>  
> +PMD Power Management API
> +------------------------
> +

Blank line missing below:

> +Abstract
> +~~~~~~~~
> +Existing power management mechanisms require developers to change application
> +design or change code to make use of it. The PMD power management API provides a
> +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
> +power saving whenever empty poll count reaches a certain number.
> +

Here you start a list of schemes, without introducing it.

> +  * UMWAIT/UMONITOR
> +
> +   This power saving scheme will put the CPU into optimized power state and use
> +   the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX descriptor
> +   address, and wake the CPU up whenever there's new traffic.
> +
> +  * Pause
> +
> +   This power saving scheme will use the `rte_pause` function to avoid busy
> +   polling.
> +
> +  * Frequency scaling
> +
> +   This power saving scheme will use existing power library functionality to
> +   scale the core frequency up/down depending on traffic volume.
> +
> +
> +.. note::
> +
> +   Currently, this power management API is limited to mandatory mapping of 1
> +   queue to 1 core (multiple queues are supported, but they must be polled from
> +   different cores).
> +
> +API Overview for PMD Power Management
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Underlining is long enough...
... but a blank line is missing after the title.

> +* **Queue Enable**: Enable specific power scheme for certain queue/port/core
> +
> +* **Queue Disable**: Disable power scheme for certain queue/port/core




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v9 10/10] doc: update programmer's guide for power library
  2020-10-24 20:49                 ` Thomas Monjalon
@ 2020-10-27 11:04                   ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-10-27 11:04 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang,
	beilei.xing, jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, drc

Hi Thomas, 
   all is fixed in v10.
Regards
Liang
On 24 Oct 22:49, Thomas Monjalon wrote:
> 24/10/2020 01:06, Liang Ma:
> > Update programmer's guide to document PMD power management usage.
> > 
> > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > Acked-by: David Hunt <david.hunt@intel.com>
> > ---
> >  doc/guides/prog_guide/power_man.rst | 42 +++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> 
> Why don't you update this doc in the patch 5
> adding the code in the power library?
> 
> 
> > --- a/doc/guides/prog_guide/power_man.rst
> > +++ b/doc/guides/prog_guide/power_man.rst
> > @@ -192,6 +192,45 @@ User Cases
> >  ----------
> >  The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
> >  
> > +PMD Power Management API
> > +------------------------
> > +
> 
> Blank line missing below:
> 
> > +Abstract
> > +~~~~~~~~
> > +Existing power management mechanisms require developers to change application
> > +design or change code to make use of it. The PMD power management API provides a
> > +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
> > +power saving whenever empty poll count reaches a certain number.
> > +
> 
> Here you start a list of schemes, without introducing it.
> 
> > +  * UMWAIT/UMONITOR
> > +
> > +   This power saving scheme will put the CPU into optimized power state and use
> > +   the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX descriptor
> > +   address, and wake the CPU up whenever there's new traffic.
> > +
> > +  * Pause
> > +
> > +   This power saving scheme will use the `rte_pause` function to avoid busy
> > +   polling.
> > +
> > +  * Frequency scaling
> > +
> > +   This power saving scheme will use existing power library functionality to
> > +   scale the core frequency up/down depending on traffic volume.
> > +
> > +
> > +.. note::
> > +
> > +   Currently, this power management API is limited to mandatory mapping of 1
> > +   queue to 1 core (multiple queues are supported, but they must be polled from
> > +   different cores).
> > +
> > +API Overview for PMD Power Management
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Underlining is long enough...
> ... but a blank line is missing after the title.
> 
> > +* **Queue Enable**: Enable specific power scheme for certain queue/port/core
> > +
> > +* **Queue Disable**: Disable power scheme for certain queue/port/core
> 
> 
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
  2020-10-24 20:39                 ` Thomas Monjalon
@ 2020-10-27 11:15                   ` Liang, Ma
  2020-10-27 15:52                     ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Liang, Ma @ 2020-10-27 11:15 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang,
	beilei.xing, jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, drc, Andrew Rybchenko, ferruh.yigit,
	jerinj, hemant.agrawal, viacheslavo, matan, ajit.khaparde,
	rahul.lakkireddy, johndale, xavier.huwei, shahafs, sthemmin,
	g.singh, rmody, maxime.coquelin, david.marchand

<snip>
thanks for your information. 
Sorry for that. 
All related maintainer(include other NIC PMD) will be Cced in v10.

> You keep forgetting Cc ethdev maintainers
> (it is automatic when using --cc-cmd devtools/get-maintainer.sh).
> As a result we still don't have any feedback from Ferruh and Andrew.
> And I believe such API requires having feedback from maintainers
> of other NIC drivers. I tried to explain this concern already
> in email and community meetings, but I see no progress.
> 
> Below are my comments.
> 
> > --- a/doc/guides/rel_notes/release_20_11.rst
> > +++ b/doc/guides/rel_notes/release_20_11.rst
> > @@ -139,6 +139,11 @@ New Features
> >    Hairpin Tx part flow rules can be inserted explicitly.
> >    New API is added to get the hairpin peer ports list.
> >  
> > +* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**
> 
> The guidelines at the top of the file say using past tense.
> No need to mention it is experimental as any new API.
agree
> 
> > +
> > +  * ``rte_eth_get_wake_addr()``
> > +  * add new eth_dev_ops ``get_wake_addr``
> 
> No need to mention the dev_ops in the release notes.
> Better to explain what it brings from an user perspective.
agree
> 
> > --- a/lib/librte_ethdev/rte_ethdev.h
> > +++ b/lib/librte_ethdev/rte_ethdev.h
> > +/**
> > + * Retrieve the wake up address for the receive queue.
> 
> I guess how this function should be used,
> but a bit more explanations would not hurt here.
agree
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The Rx queue on the Ethernet device for which information will be
> > + *   retrieved.
> > + * @param wake_addr
> > + *   The pointer to the address which will be monitored.
> 
> This function does not make the address monitored, right?
This function only get the target wakeup address. that does not monitor this address.
> 
> > + * @param expected
> > + *   The pointer to value to be expected when descriptor is set.
> 
> Not sure we should restrict it to a "descriptor".
  actully that is not limited to a descriptor, any writeback content should work.
> 
> Expecting a value or some bits looks too much restrictive.
> I understand it probably fits well for Intel NICs,
> but in the general case, we can imagine that any change
> in a byte array could be a wake up signal.
this parameter doesn not limited user how to use it.
In fact, current design can support any bits change within 64 bits content.
> 
> > + * @param mask
> > + *   The pointer to comparison bitmask for the expected value.
> > + * @param data_sz
> > + *   The pointer to data size for the expected value and comparison bitmask.
> 
> It is not clear that above 4 parameters are filled by the driver.
> 
> > + *
> > + * @return
> > + *   - 0: Success.
> > + *   -ENOTSUP: Operation not supported.
> > + *   -EINVAL: Invalid parameters.
> > + *   -ENODEV: Invalid port ID.
> > + */
> > +__rte_experimental
> > +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> > +		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
> > +		uint8_t *data_sz);
> [...]
> > --- a/lib/librte_ethdev/version.map
> > +++ b/lib/librte_ethdev/version.map
> > @@ -244,6 +244,7 @@ EXPERIMENTAL {
> >  	rte_flow_get_restore_info;
> >  	rte_flow_tunnel_action_decap_release;
> >  	rte_flow_tunnel_item_release;
> > +	rte_eth_get_wake_addr;
> >  };
> 
> Please sort in alphabetical order.
agree will do
> 
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
@ 2020-10-27 14:59                 ` Liang Ma
  2020-10-27 16:02                   ` Thomas Monjalon
                                     ` (2 more replies)
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 1/9] eal: add new x86 cpuid support for WAITPKG Liang Ma
                                   ` (8 subsequent siblings)
  9 siblings, 3 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-27 14:59 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale,
	xavier.huwei, xuanziyang2, matan, yongwang

This patchset proposes a simple API for Ethernet drivers
to cause the CPU to enter a power-optimized state while
waiting for packets to arrive, along with a set of
generic intrinsics that facilitate that. This is achieved
through cooperation with the NIC driver that will allow
us to know address of wake up event, and wait for writes
on it.

On IA, this is achieved through using UMONITOR/UMWAIT
instructions. They are used in their raw opcode form
because there is no widespread compiler support for
them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen
to implement similar instructions.

To achieve power savings, there is a very simple mechanism
used: we're counting empty polls, and if a certain threshold
is reached, we get the address of next RX ring descriptor
from the NIC driver, arm the monitoring hardware, and
enter a power-optimized state. We will then wake up when
either a timeout happens, or a write happens (or generally
whenever CPU feels like waking up - this is platform-
specific), and proceed as normal. The empty poll counter is
reset whenever we actually get packets, so we only go to
sleep when we know nothing is going on. The mechanism is
generic which can be used for any write back descriptor.

Why are we putting it into ethdev as opposed to leaving
this up to the application? Our customers specifically
requested a way to do it wit minimal changes to the
application code. The current approach allows to just
flip a switch and automatically have power savings.

- Only 1:1 core to queue mapping is supported,
  meaning that each lcore must at most handle RX on a
  single queue
- Support 3 type policies. UMWAIT/PAUSE/Frequency_Scale
- Power management is enabled per-queue
- The API doesn't extend to other device types

Liang Ma (9):
  eal: add new x86 cpuid support for WAITPKG
  eal: add power management intrinsics
  eal: add intrinsics support check infrastructure
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  48 +++
 doc/guides/rel_notes/release_20_11.rst        |  15 +
 .../sample_app_ug/l3_forward_power_man.rst    |  13 +
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  26 ++
 drivers/net/i40e/i40e_rxtx.h                  |   2 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   2 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
 examples/l3fwd-power/main.c                   |  46 ++-
 lib/librte_eal/arm/include/meson.build        |   1 +
 .../arm/include/rte_power_intrinsics.h        |  60 ++++
 lib/librte_eal/arm/rte_cpuflags.c             |   6 +
 lib/librte_eal/include/generic/rte_cpuflags.h |  26 ++
 .../include/generic/rte_power_intrinsics.h    | 123 +++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/ppc/include/meson.build        |   1 +
 .../ppc/include/rte_power_intrinsics.h        |  60 ++++
 lib/librte_eal/ppc/rte_cpuflags.c             |   7 +
 lib/librte_eal/version.map                    |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 135 ++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |  14 +
 lib/librte_ethdev/rte_ethdev.c                |  23 ++
 lib/librte_ethdev/rte_ethdev.h                |  33 ++
 lib/librte_ethdev/rte_ethdev_driver.h         |  28 ++
 lib/librte_ethdev/version.map                 |   1 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 320 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  92 +++++
 lib/librte_power/version.map                  |   4 +
 35 files changed, 1148 insertions(+), 3 deletions(-)
 create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v10 1/9] eal: add new x86 cpuid support for WAITPKG
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 0/9] " Liang Ma
@ 2020-10-27 14:59                 ` Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 2/9] eal: add power management intrinsics Liang Ma
                                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-27 14:59 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale,
	xavier.huwei, xuanziyang2, matan, yongwang, Anatoly Burakov

Add a new CPUID flag indicating processor support for UMONITOR/UMWAIT
and TPAUSE instructions instruction.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_eal/x86/include/rte_cpuflags.h | 1 +
 lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..848ba9cbfb 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	RTE_CPUFLAG_WAITPKG,                /**< UMONITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v10 2/9] eal: add power management intrinsics
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 0/9] " Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 1/9] eal: add new x86 cpuid support for WAITPKG Liang Ma
@ 2020-10-27 14:59                 ` Liang Ma
  2020-10-29 17:39                   ` Thomas Monjalon
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure Liang Ma
                                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-10-27 14:59 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale,
	xavier.huwei, xuanziyang2, matan, yongwang, Anatoly Burakov,
	Jerin Jacob, Jan Viktorin, David Christensen

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

For more details, please refer to Intel(R) 64 and IA-32 Architectures
Software Developer's Manual, Volume 2.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_eal/arm/include/meson.build        |   1 +
 .../arm/include/rte_power_intrinsics.h        |  60 ++++++++
 .../include/generic/rte_power_intrinsics.h    | 111 ++++++++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/ppc/include/meson.build        |   1 +
 .../ppc/include/rte_power_intrinsics.h        |  60 ++++++++
 lib/librte_eal/x86/include/meson.build        |   1 +
 .../x86/include/rte_power_intrinsics.h        | 135 ++++++++++++++++++
 8 files changed, 370 insertions(+)
 create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
index 73b750a18f..c6a9f70d73 100644
--- a/lib/librte_eal/arm/include/meson.build
+++ b/lib/librte_eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
 	'rte_pause_32.h',
 	'rte_pause_64.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch_32.h',
 	'rte_prefetch_64.h',
 	'rte_prefetch.h',
diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..a4a1bc1159
--- /dev/null
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_ARM_H_
+#define _RTE_POWER_INTRINSIC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..fb897d9060
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,111 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ * @param data_sz
+ *   Data size (in bytes) that will be used to compare expected value with the
+ *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
+ *   to undefined result.
+ */
+__rte_experimental
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This call will also lock a spinlock on entering sleep, and release it on
+ * waking up the CPU.
+ *
+ * @param p
+ *   Address to monitor for changes.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ * @param data_sz
+ *   Data size (in bytes) that will be used to compare expected value with the
+ *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
+ *   to undefined result.
+ * @param lck
+ *   A spinlock that must be locked before entering the function, will be
+ *   unlocked while the CPU is sleeping, and will be locked again once the CPU
+ *   wakes up.
+ */
+__rte_experimental
+static inline void rte_power_monitor_sync(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp, const uint8_t data_sz,
+		rte_spinlock_t *lck);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ */
+__rte_experimental
+static inline void rte_power_pause(const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
index ab4bd28092..0873b2aecb 100644
--- a/lib/librte_eal/ppc/include/meson.build
+++ b/lib/librte_eal/ppc/include/meson.build
@@ -10,6 +10,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch.h',
 	'rte_rwlock.h',
 	'rte_spinlock.h',
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..4ed03d521f
--- /dev/null
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_PPC_H_
+#define _RTE_POWER_INTRINSIC_PPC_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..f9b761d796
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_H_
+#define _RTE_POWER_INTRINSIC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+static inline uint64_t
+__get_umwait_val(const volatile void *p, const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):
+		return *(const volatile uint8_t *)p;
+	case sizeof(uint16_t):
+		return *(const volatile uint16_t *)p;
+	case sizeof(uint32_t):
+		return *(const volatile uint32_t *)p;
+	case sizeof(uint64_t):
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		RTE_ASSERT(0);
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+static inline void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+		: /* ignore rflags */
+		: "D"(0), /* enter C0.2 */
+		  "a"(tsc_l), "d"(tsc_h));
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
                                   ` (2 preceding siblings ...)
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 2/9] eal: add power management intrinsics Liang Ma
@ 2020-10-27 14:59                 ` Liang Ma
  2020-10-29 21:27                   ` David Marchand
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 4/9] ethdev: add simple power management API Liang Ma
                                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-10-27 14:59 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale,
	xavier.huwei, xuanziyang2, matan, yongwang, Anatoly Burakov,
	Jerin Jacob, Jan Viktorin, David Christensen, Ray Kinsella

Currently, it is not possible to check support for intrinsics that
are platform-specific, cannot be abstracted in a generic way, or do not
have support on all architectures. The CPUID flags can be used to some
extent, but they are only defined for their platform, while intrinsics
will be available to all code as they are in generic headers.

This patch introduces infrastructure to check support for certain
platform-specific intrinsics, and adds support for checking support for
IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Ray Kinsella <mdr@ashroe.eu>
--

Notes:
   v8:
    - Rename eal version.map

   v6:
    - Fix the comments
---
 lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
 lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
 .../include/generic/rte_power_intrinsics.h    | 12 +++++++++
 lib/librte_eal/ppc/rte_cpuflags.c             |  7 +++++
 lib/librte_eal/version.map                    |  1 +
 lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
 6 files changed, 64 insertions(+)

diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
index 7b257b7873..e3a53bcece 100644
--- a/lib/librte_eal/arm/rte_cpuflags.c
+++ b/lib/librte_eal/arm/rte_cpuflags.c
@@ -151,3 +151,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
index 872f0ebe3e..28a5aecde8 100644
--- a/lib/librte_eal/include/generic/rte_cpuflags.h
+++ b/lib/librte_eal/include/generic/rte_cpuflags.h
@@ -13,6 +13,32 @@
 #include "rte_common.h"
 #include <errno.h>
 
+#include <rte_compat.h>
+
+/**
+ * Structure used to describe platform-specific intrinsics that may or may not
+ * be supported at runtime.
+ */
+struct rte_cpu_intrinsics {
+	uint32_t power_monitor : 1;
+	/**< indicates support for rte_power_monitor function */
+	uint32_t power_pause : 1;
+	/**< indicates support for rte_power_pause function */
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Check CPU support for various intrinsics at runtime.
+ *
+ * @param intrinsics
+ *     Pointer to a structure to be filled.
+ */
+__rte_experimental
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
+
 /**
  * Enumeration of all CPU features supported
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index fb897d9060..03a326f076 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -32,6 +32,10 @@
  * checked against the expected value, and if they match, the entering of
  * optimized power state may be aborted.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes.
  * @param expected_value
@@ -69,6 +73,10 @@ static inline void rte_power_monitor(const volatile void *p,
  * This call will also lock a spinlock on entering sleep, and release it on
  * waking up the CPU.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes.
  * @param expected_value
@@ -101,6 +109,10 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  * Enter an architecture-defined optimized power state until a certain TSC
  * timestamp is reached.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
index 3bb7563ce9..61db5c216d 100644
--- a/lib/librte_eal/ppc/rte_cpuflags.c
+++ b/lib/librte_eal/ppc/rte_cpuflags.c
@@ -8,6 +8,7 @@
 #include <elf.h>
 #include <fcntl.h>
 #include <assert.h>
+#include <string.h>
 #include <unistd.h>
 
 /* Symbolic values for the entries in the auxiliary table */
@@ -108,3 +109,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index c23ff57ce6..269cdccfd3 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -402,6 +402,7 @@ EXPERIMENTAL {
 	rte_service_lcore_may_be_active;
 	rte_vect_get_max_simd_bitwidth;
 	rte_vect_set_max_simd_bitwidth;
+	rte_cpu_get_intrinsics_support;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 0325c4b93b..a96312ff7f 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -7,6 +7,7 @@
 #include <stdio.h>
 #include <errno.h>
 #include <stdint.h>
+#include <string.h>
 
 #include "rte_cpuid.h"
 
@@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
+		intrinsics->power_monitor = 1;
+		intrinsics->power_pause = 1;
+	}
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v10 4/9] ethdev: add simple power management API
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
                                   ` (3 preceding siblings ...)
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure Liang Ma
@ 2020-10-27 14:59                 ` Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 5/9] power: add PMD power management API and callback Liang Ma
                                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-27 14:59 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale,
	xavier.huwei, xuanziyang2, matan, yongwang, Anatoly Burakov,
	Ferruh Yigit, Andrew Rybchenko, Ray Kinsella

Add a simple API to allow getting address of next RX descriptor from the
PMD, as well as release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v10:
    - Address minor issue on comments and release notes

    v8:
    - Rename version map file name

    v7:
    - Fixed race condition (Konstantin)
    - Slight rework of the structure of monitor code
    - Added missing inline for wakeup

    v6:
    - Added wakeup mechanism for UMWAIT
    - Removed memory allocation (everything is now allocated statically)
    - Fixed various typos and comments
    - Check for invalid queue ID
    - Moved release notes to this patch

    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
    - Add some debug logging
    - Replace x86-specific code path to generic path using the intrinsic check
---
 doc/guides/rel_notes/release_20_11.rst |  4 ++++
 lib/librte_ethdev/rte_ethdev.c         | 23 ++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h         | 33 ++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h  | 28 ++++++++++++++++++++++
 lib/librte_ethdev/version.map          |  1 +
 5 files changed, 89 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index dca8d41eb6..2bdc8f9948 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -139,6 +139,10 @@ New Features
   Hairpin Tx part flow rules can be inserted explicitly.
   New API is added to get the hairpin peer ports list.
 
+* **ethdev: added 1 new API for PMD power management.**
+
+  * ``rte_eth_get_wake_addr()`` is added to get the wake up address from device.
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index b12bb3854d..4f3115fe8e 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -5138,6 +5138,29 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_wake_addr(dev->data->rx_queues[queue_id],
+			wake_addr, expected, mask, data_sz));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index e341a08817..e522d5d433 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4364,6 +4364,39 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * Retrieve the wake up address for device queue or other data structure.
+ * Depend on the HW design, any write back mechanism can be used as a signal
+ * to wake up the processor.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information will be
+ *   retrieved.
+ * @param wake_addr
+ *   The pointer to the address which will be monitored and filled by driver.
+ * @param expected
+ *   The pointer to value to be expected when target address is set.
+ *   filled by driver.
+ * @param mask
+ *   The pointer to comparison bitmask for the expected value that is filled by
+ *   driver.
+ * @param data_sz
+ *   The pointer to data size for the expected value and comparison bitmask that
+ *   is filled by driver.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index c63b9f7eb7..d7548dfe74 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -752,6 +752,32 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
 	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
 /**< @internal Unbind peer queue from the current queue. */
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an RX queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param tail_desc_addr
+ *   The pointer point to where the address will be stored.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ * @param data_sz
+ *   Data size for the expected value (can be 1, 2, 4, or 8 bytes)
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -910,6 +936,8 @@ struct eth_dev_ops {
 	/**< Set up the connection between the pair of hairpin queues. */
 	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
 	/**< Disconnect the hairpin queues of a pair from each other. */
+	eth_get_wake_addr_t get_wake_addr;
+	/**< Get next RX queue ring entry address. */
 };
 
 /**
diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
index 8ddda2547f..f9ce4e3c8d 100644
--- a/lib/librte_ethdev/version.map
+++ b/lib/librte_ethdev/version.map
@@ -235,6 +235,7 @@ EXPERIMENTAL {
 	rte_eth_fec_get_capability;
 	rte_eth_fec_get;
 	rte_eth_fec_set;
+	rte_eth_get_wake_addr;
 	rte_flow_shared_action_create;
 	rte_flow_shared_action_destroy;
 	rte_flow_shared_action_query;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v10 5/9] power: add PMD power management API and callback
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
                                   ` (4 preceding siblings ...)
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 4/9] ethdev: add simple power management API Liang Ma
@ 2020-10-27 14:59                 ` Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 6/9] net/ixgbe: implement power management API Liang Ma
                                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-27 14:59 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale,
	xavier.huwei, xuanziyang2, matan, yongwang, Anatoly Burakov,
	Ray Kinsella

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. Pause instruction

   Instead of move the core into deeper C state, this method uses the
   pause instruction to avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v10:
    - Updated power library document

    v8:
    - Rename version map file name

    v7:
    - Fixed race condition (Konstantin)
    - Slight rework of the structure of monitor code
    - Added missing inline for wakeup

    v6:
    - Added wakeup mechanism for UMWAIT
    - Removed memory allocation (everything is now allocated statically)
    - Fixed various typos and comments
    - Check for invalid queue ID
    - Moved release notes to this patch

    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support
        get_wake_addr
    - Add some debug logging
    - Replace x86-specific code path to generic path using the
      intrinsic check
---
 doc/guides/prog_guide/power_man.rst    |  48 ++++
 doc/guides/rel_notes/release_20_11.rst |  11 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 320 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  92 +++++++
 lib/librte_power/version.map           |   4 +
 6 files changed, 478 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..1b1064c749 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,51 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+There are multiple power saving schemes available for developer to choose.
+Although developer can configure each queue with different scheme, however,
+It's strongly recommend to configure the queue within same port with same scheme.
+
+  * UMWAIT/UMONITOR
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX descriptor
+   address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will use the `rte_pause` function to avoid busy
+   polling.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing power library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +245,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 2bdc8f9948..5fd8c16025 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -349,6 +349,17 @@ New Features
   * Replaced ``--scalar`` command-line option with ``--alg=<value>``, to allow
     the user to select the desired classify method.
 
+* **Add PMD power management mechanism**
+
+  3 new Ethernet PMD power management mechanism is added through existing
+  RX callback infrastructure.
+
+  * Add power saving scheme based on UMWAIT instruction (x86 only)
+  * Add power saving scheme based on ``rte_pause()``
+  * Add power saving scheme based on frequency scaling through the power library
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..0dcaddc3bd
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,320 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	rte_spinlock_t umwait_lock;
+	/**< Per-queue status lock - used only for UMWAIT mode */
+	volatile void *wait_addr;
+	/**< UMWAIT wakeup address */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+/* trigger a write to the cache line we're waiting on */
+static inline void
+umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
+static inline void
+umwait_sleep(struct pmd_queue_cfg *q_conf, uint16_t port_id, uint16_t qidx)
+{
+	volatile void *target_addr;
+	uint64_t expected, mask;
+	uint8_t data_sz;
+	uint16_t ret;
+
+	/*
+	 * get wake up address for this RX queue, as well as expected value,
+	 * comparison mask, and data size.
+	 */
+	ret = rte_eth_get_wake_addr(port_id, qidx, &target_addr,
+			&expected, &mask, &data_sz);
+
+	/* this should always succeed as all checks have been done already */
+	if (unlikely(ret != 0))
+		return;
+
+	/*
+	 * take out a spinlock to prevent control plane from concurrently
+	 * modifying the wakeup data.
+	 */
+	rte_spinlock_lock(&q_conf->umwait_lock);
+
+	/* have we been disabled by control plane? */
+	if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		/* we're good to go */
+
+		/*
+		 * store the wakeup address so that control plane can trigger a
+		 * write to this address and wake us up.
+		 */
+		q_conf->wait_addr = target_addr;
+		/* -1ULL is maximum value for TSC */
+		rte_power_monitor_sync(target_addr, expected, mask, -1ULL,
+				data_sz, &q_conf->umwait_lock);
+		/* erase the address */
+		q_conf->wait_addr = NULL;
+	}
+	rte_spinlock_unlock(&q_conf->umwait_lock);
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			umwait_sleep(q_conf, port_id, qidx);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			rte_delay_us(1);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* check if queue id is valid */
+	if (queue_id >= dev->data->nb_rx_queues ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		return -EINVAL;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	{
+		/* check if rte_power_monitor is supported */
+		uint64_t dummy_expected, dummy_mask;
+		struct rte_cpu_intrinsics i;
+		volatile void *dummy_addr;
+		uint8_t dummy_sz;
+
+		rte_cpu_get_intrinsics_support(&i);
+
+		if (!i.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_wake_addr(port_id, queue_id,
+				&dummy_addr, &dummy_expected,
+				&dummy_mask, &dummy_sz) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize UMWAIT spinlock */
+		rte_spinlock_init(&queue_cfg->umwait_lock);
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+		rte_spinlock_lock(&queue_cfg->umwait_lock);
+
+		/* wake up the core from UMWAIT sleep, if any */
+		if (queue_cfg->wait_addr != NULL)
+			umwait_wakeup(queue_cfg->wait_addr);
+		/*
+		 * we need to disable early as there might be callback currently
+		 * spinning on a lock.
+		 */
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+		rte_spinlock_unlock(&queue_cfg->umwait_lock);
+		/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	ret = 0;
+end:
+	return ret;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..a7a3f98268
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v10 6/9] net/ixgbe: implement power management API
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
                                   ` (5 preceding siblings ...)
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 5/9] power: add PMD power management API and callback Liang Ma
@ 2020-10-27 14:59                 ` Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 7/9] net/i40e: " Liang Ma
                                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-27 14:59 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale,
	xavier.huwei, xuanziyang2, matan, yongwang, Anatoly Burakov,
	Jeff Guo

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 00101c2eec..fcc4026372 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_wake_addr        = ixgbe_get_wake_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 5f19972031..305822836d 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1367,6 +1367,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	*data_sz = 4;
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 6d2f7c9da3..1ef0b05e66 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v10 7/9] net/i40e: implement power management API
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
                                   ` (6 preceding siblings ...)
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 6/9] net/ixgbe: implement power management API Liang Ma
@ 2020-10-27 14:59                 ` Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 8/9] net/ice: " Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 9/9] examples/l3fwd-power: enable PMD power mgmt Liang Ma
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-27 14:59 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale,
	xavier.huwei, xuanziyang2, matan, yongwang, Anatoly Burakov,
	Beilei Xing, Jeff Guo

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 4778aaf299..358a38232b 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_wake_addr	              = i40e_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5df9a9df56..78862fe3a2 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -72,6 +72,32 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	*data_sz = 8;
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..5826cf1099 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value, uint8_t *data_sz);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v10 8/9] net/ice: implement power management API
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
                                   ` (7 preceding siblings ...)
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 7/9] net/i40e: " Liang Ma
@ 2020-10-27 14:59                 ` Liang Ma
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 9/9] examples/l3fwd-power: enable PMD power mgmt Liang Ma
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-27 14:59 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale,
	xavier.huwei, xuanziyang2, matan, yongwang, Anatoly Burakov,
	Qiming Yang, Qi Zhang

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index c65125ff32..54f185ad4d 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_wake_addr	              = ice_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index ee576c362a..fafd6ada62 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	*data_sz = 2;
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1c23c7541e..7eeb8d467e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v10 9/9] examples/l3fwd-power: enable PMD power mgmt
  2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
                                   ` (8 preceding siblings ...)
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 8/9] net/ice: " Liang Ma
@ 2020-10-27 14:59                 ` Liang Ma
  9 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-27 14:59 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale,
	xavier.huwei, xuanziyang2, matan, yongwang, Anatoly Burakov

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v8:
    - Add return status check for queue enable

    v6:
    - Fixed typos in documentation
---
 .../sample_app_ug/l3_forward_power_man.rst    | 13 ++++++
 examples/l3fwd-power/main.c                   | 46 ++++++++++++++++++-
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index d7e1dc5813..b10ebe662e 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -455,3 +457,14 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index a48d75f68f..aafa415f0b 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,7 +200,8 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
@@ -1750,6 +1752,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1771,6 +1774,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1881,6 +1885,16 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt  mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2437,6 +2451,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2705,6 +2721,17 @@ main(int argc, char **argv)
 			} else if (!check_ptype(portid))
 				rte_exit(EXIT_FAILURE,
 					 "PMD can not provide needed ptypes\n");
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_pmd_mgmt_queue_enable(lcore_id,
+						     portid, queueid,
+						     RTE_POWER_MGMT_TYPE_SCALE);
+
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+					"rte_power_pmd_mgmt enable: err=%d, "
+					"port=%d\n", ret, portid);
+
+			}
 		}
 	}
 
@@ -2790,6 +2817,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL,
+					 CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2816,6 +2846,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
  2020-10-27 11:15                   ` Liang, Ma
@ 2020-10-27 15:52                     ` Thomas Monjalon
  2020-10-27 17:43                       ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-27 15:52 UTC (permalink / raw)
  To: Liang, Ma
  Cc: dev, anatoly.burakov, viktorin, qi.z.zhang, ruifeng.wang,
	beilei.xing, jia.guo, qiming.yang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, drc, Andrew Rybchenko, ferruh.yigit,
	jerinj, hemant.agrawal, viacheslavo, matan, ajit.khaparde,
	rahul.lakkireddy, johndale, xavier.huwei, shahafs, sthemmin,
	g.singh, rmody, maxime.coquelin, david.marchand

27/10/2020 12:15, Liang, Ma:
> > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > +/**
> > > + * Retrieve the wake up address for the receive queue.
> > 
> > I guess how this function should be used,
> > but a bit more explanations would not hurt here.
> agree
> > > + *
> > > + * @param port_id
> > > + *   The port identifier of the Ethernet device.
> > > + * @param queue_id
> > > + *   The Rx queue on the Ethernet device for which information will be
> > > + *   retrieved.
> > > + * @param wake_addr
> > > + *   The pointer to the address which will be monitored.
> > 
> > This function does not make the address monitored, right?
> This function only get the target wakeup address. that does not monitor this address.
> > 
> > > + * @param expected
> > > + *   The pointer to value to be expected when descriptor is set.
> > 
> > Not sure we should restrict it to a "descriptor".
>   actully that is not limited to a descriptor, any writeback content should work.
> > 
> > Expecting a value or some bits looks too much restrictive.
> > I understand it probably fits well for Intel NICs,
> > but in the general case, we can imagine that any change
> > in a byte array could be a wake up signal.
> 
> this parameter doesn not limited user how to use it.
> In fact, current design can support any bits change within 64 bits content.

How the driver can specify that any value change should be monitored?
I understand that it is only a value/mask pair,
it does not give room for "any value".



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 0/9] " Liang Ma
@ 2020-10-27 16:02                   ` Thomas Monjalon
  2020-10-28 13:35                     ` Liang, Ma
  2020-10-27 20:53                   ` Ajit Khaparde
  2020-10-29 17:42                   ` Thomas Monjalon
  2 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-27 16:02 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, ruifeng.wang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra,
	johndale, xavier.huwei, xuanziyang2, matan, yongwang

Please be more patient.
I asked some questions on v9 (ethdev API is not generic enough),
and you send a v10 in the same minute you reply,
without making sure I agree with your answers.




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
  2020-10-27 15:52                     ` Thomas Monjalon
@ 2020-10-27 17:43                       ` Ananyev, Konstantin
  2020-10-27 18:30                         ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-27 17:43 UTC (permalink / raw)
  To: Thomas Monjalon, Ma, Liang J
  Cc: dev, Burakov, Anatoly, viktorin, Zhang, Qi Z, ruifeng.wang, Xing,
	Beilei, Guo, Jia, Yang, Qiming, Wang, Haiyue, Richardson,  Bruce,
	Hunt, David, jerinjacobk, nhorman, McDaniel, Timothy, Eads, Gage,
	drc, Andrew Rybchenko, Yigit, Ferruh, jerinj, hemant.agrawal,
	viacheslavo, matan, ajit.khaparde, rahul.lakkireddy, johndale,
	xavier.huwei, shahafs, sthemmin, g.singh, rmody, maxime.coquelin,
	david.marchand



> 27/10/2020 12:15, Liang, Ma:
> > > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > > +/**
> > > > + * Retrieve the wake up address for the receive queue.
> > >
> > > I guess how this function should be used,
> > > but a bit more explanations would not hurt here.
> > agree
> > > > + *
> > > > + * @param port_id
> > > > + *   The port identifier of the Ethernet device.
> > > > + * @param queue_id
> > > > + *   The Rx queue on the Ethernet device for which information will be
> > > > + *   retrieved.
> > > > + * @param wake_addr
> > > > + *   The pointer to the address which will be monitored.
> > >
> > > This function does not make the address monitored, right?
> > This function only get the target wakeup address. that does not monitor this address.
> > >
> > > > + * @param expected
> > > > + *   The pointer to value to be expected when descriptor is set.
> > >
> > > Not sure we should restrict it to a "descriptor".
> >   actully that is not limited to a descriptor, any writeback content should work.
> > >
> > > Expecting a value or some bits looks too much restrictive.
> > > I understand it probably fits well for Intel NICs,
> > > but in the general case, we can imagine that any change
> > > in a byte array could be a wake up signal.
> >
> > this parameter doesn not limited user how to use it.
> > In fact, current design can support any bits change within 64 bits content.
> 
> How the driver can specify that any value change should be monitored?
> I understand that it is only a value/mask pair,
> it does not give room for "any value".

As I can read the code, value=0, mask=0 will provide you with 'any value'.
Though it would mean that rte_power_monitor() will *always* go into sleep,
so not sure what will be there any practical usage for such case. 
Konstantin



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
  2020-10-27 17:43                       ` Ananyev, Konstantin
@ 2020-10-27 18:30                         ` Thomas Monjalon
  2020-10-27 23:29                           ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-27 18:30 UTC (permalink / raw)
  To: Ma, Liang J, Ananyev, Konstantin
  Cc: dev, Burakov, Anatoly, viktorin, Zhang, Qi Z, ruifeng.wang, Xing,
	Beilei, Guo, Jia, Yang, Qiming, Wang, Haiyue, Richardson, Bruce,
	Hunt, David, jerinjacobk, nhorman, McDaniel, Timothy, Eads, Gage,
	drc, Andrew Rybchenko, Yigit, Ferruh, jerinj, hemant.agrawal,
	viacheslavo, matan, ajit.khaparde, rahul.lakkireddy, johndale,
	xavier.huwei, shahafs, sthemmin, g.singh, rmody, maxime.coquelin,
	david.marchand

27/10/2020 18:43, Ananyev, Konstantin:
> > 27/10/2020 12:15, Liang, Ma:
> > > > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > > > +/**
> > > > > + * Retrieve the wake up address for the receive queue.
> > > >
> > > > I guess how this function should be used,
> > > > but a bit more explanations would not hurt here.
> > > agree
> > > > > + *
> > > > > + * @param port_id
> > > > > + *   The port identifier of the Ethernet device.
> > > > > + * @param queue_id
> > > > > + *   The Rx queue on the Ethernet device for which information will be
> > > > > + *   retrieved.
> > > > > + * @param wake_addr
> > > > > + *   The pointer to the address which will be monitored.
> > > >
> > > > This function does not make the address monitored, right?
> > > This function only get the target wakeup address. that does not monitor this address.
> > > >
> > > > > + * @param expected
> > > > > + *   The pointer to value to be expected when descriptor is set.
> > > >
> > > > Not sure we should restrict it to a "descriptor".
> > >   actully that is not limited to a descriptor, any writeback content should work.
> > > >
> > > > Expecting a value or some bits looks too much restrictive.
> > > > I understand it probably fits well for Intel NICs,
> > > > but in the general case, we can imagine that any change
> > > > in a byte array could be a wake up signal.
> > >
> > > this parameter doesn not limited user how to use it.
> > > In fact, current design can support any bits change within 64 bits content.
> > 
> > How the driver can specify that any value change should be monitored?
> > I understand that it is only a value/mask pair,
> > it does not give room for "any value".
> 
> As I can read the code, value=0, mask=0 will provide you with 'any value'.
> Though it would mean that rte_power_monitor() will *always* go into sleep,
> so not sure what will be there any practical usage for such case. 

I think what is missing is to allow waking up when the value
of a byte array is changing, without specifiying any value.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 0/9] " Liang Ma
  2020-10-27 16:02                   ` Thomas Monjalon
@ 2020-10-27 20:53                   ` Ajit Khaparde
  2020-10-28 12:13                     ` Liang, Ma
  2020-10-29 17:42                   ` Thomas Monjalon
  2 siblings, 1 reply; 421+ messages in thread
From: Ajit Khaparde @ 2020-10-27 20:53 UTC (permalink / raw)
  To: Liang Ma
  Cc: dpdk-dev, Ruifeng Wang, Haiyue Wang, Bruce Richardson, Ananyev,
	Konstantin, david.hunt, Jerin Jacob, Neil Horman,
	Thomas Monjalon, timothy.mcdaniel, gage.eads, Marcin Wojtas,
	Guy Tzalik, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, Matan Azrad, Yong Wang

On Tue, Oct 27, 2020 at 7:59 AM Liang Ma <liang.j.ma@intel.com> wrote:
>
> This patchset proposes a simple API for Ethernet drivers
> to cause the CPU to enter a power-optimized state while
> waiting for packets to arrive, along with a set of
> generic intrinsics that facilitate that. This is achieved
> through cooperation with the NIC driver that will allow
> us to know address of wake up event, and wait for writes
> on it.
Is the wake event the same as ring status or interrupt status register?

So in a way the PMD is passing the address of the next ring descriptor?
So that instead of the PMD polling it, the application peeks at it and
when ready asks the PMD to actually process the packet(s)?

>
> On IA, this is achieved through using UMONITOR/UMWAIT
> instructions. They are used in their raw opcode form
> because there is no widespread compiler support for
> them yet. Still, the API is made generic enough to
> hopefully support other architectures, if they happen
> to implement similar instructions.
>
> To achieve power savings, there is a very simple mechanism
> used: we're counting empty polls, and if a certain threshold
> is reached, we get the address of next RX ring descriptor
> from the NIC driver, arm the monitoring hardware, and
> enter a power-optimized state. We will then wake up when
> either a timeout happens, or a write happens (or generally
> whenever CPU feels like waking up - this is platform-
> specific), and proceed as normal. The empty poll counter is
> reset whenever we actually get packets, so we only go to
> sleep when we know nothing is going on. The mechanism is
> generic which can be used for any write back descriptor.
>
> Why are we putting it into ethdev as opposed to leaving
> this up to the application? Our customers specifically
> requested a way to do it wit minimal changes to the
> application code. The current approach allows to just
> flip a switch and automatically have power savings.
The application still has to know address of wake up event. Right?
And then it will need the logic to count empty polls and the threshold?
This will be done by application or something else?

>
> - Only 1:1 core to queue mapping is supported,
>   meaning that each lcore must at most handle RX on a
>   single queue
> - Support 3 type policies. UMWAIT/PAUSE/Frequency_Scale
> - Power management is enabled per-queue
> - The API doesn't extend to other device types
>
> Liang Ma (9):
>   eal: add new x86 cpuid support for WAITPKG
>   eal: add power management intrinsics
>   eal: add intrinsics support check infrastructure
>   ethdev: add simple power management API
>   power: add PMD power management API and callback
>   net/ixgbe: implement power management API
>   net/i40e: implement power management API
>   net/ice: implement power management API
>   examples/l3fwd-power: enable PMD power mgmt
>
>  doc/guides/prog_guide/power_man.rst           |  48 +++
>  doc/guides/rel_notes/release_20_11.rst        |  15 +
>  .../sample_app_ug/l3_forward_power_man.rst    |  13 +
>  drivers/net/i40e/i40e_ethdev.c                |   1 +
>  drivers/net/i40e/i40e_rxtx.c                  |  26 ++
>  drivers/net/i40e/i40e_rxtx.h                  |   2 +
>  drivers/net/ice/ice_ethdev.c                  |   1 +
>  drivers/net/ice/ice_rxtx.c                    |  26 ++
>  drivers/net/ice/ice_rxtx.h                    |   2 +
>  drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
>  drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
>  drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
>  examples/l3fwd-power/main.c                   |  46 ++-
>  lib/librte_eal/arm/include/meson.build        |   1 +
>  .../arm/include/rte_power_intrinsics.h        |  60 ++++
>  lib/librte_eal/arm/rte_cpuflags.c             |   6 +
>  lib/librte_eal/include/generic/rte_cpuflags.h |  26 ++
>  .../include/generic/rte_power_intrinsics.h    | 123 +++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/ppc/include/meson.build        |   1 +
>  .../ppc/include/rte_power_intrinsics.h        |  60 ++++
>  lib/librte_eal/ppc/rte_cpuflags.c             |   7 +
>  lib/librte_eal/version.map                    |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 135 ++++++++
>  lib/librte_eal/x86/rte_cpuflags.c             |  14 +
>  lib/librte_ethdev/rte_ethdev.c                |  23 ++
>  lib/librte_ethdev/rte_ethdev.h                |  33 ++
>  lib/librte_ethdev/rte_ethdev_driver.h         |  28 ++
>  lib/librte_ethdev/version.map                 |   1 +
>  lib/librte_power/meson.build                  |   5 +-
>  lib/librte_power/rte_power_pmd_mgmt.c         | 320 ++++++++++++++++++
>  lib/librte_power/rte_power_pmd_mgmt.h         |  92 +++++
>  lib/librte_power/version.map                  |   4 +
>  35 files changed, 1148 insertions(+), 3 deletions(-)
>  create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
  2020-10-27 18:30                         ` Thomas Monjalon
@ 2020-10-27 23:29                           ` Ananyev, Konstantin
  2020-10-28  3:24                             ` Ajit Khaparde
  2020-10-28 12:24                             ` Liang, Ma
  0 siblings, 2 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-27 23:29 UTC (permalink / raw)
  To: Thomas Monjalon, Ma, Liang J
  Cc: dev, Burakov, Anatoly, viktorin, Zhang, Qi Z, ruifeng.wang, Xing,
	Beilei, Guo, Jia, Yang, Qiming, Wang, Haiyue, Richardson,  Bruce,
	Hunt, David, jerinjacobk, nhorman, McDaniel, Timothy, Eads, Gage,
	drc, Andrew Rybchenko, Yigit, Ferruh, jerinj, hemant.agrawal,
	viacheslavo, matan, ajit.khaparde, rahul.lakkireddy, johndale,
	xavier.huwei, shahafs, sthemmin, g.singh, rmody, maxime.coquelin,
	david.marchand



> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Tuesday, October 27, 2020 6:31 PM
> To: Ma, Liang J <liang.j.ma@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: dev@dpdk.org; Burakov, Anatoly <anatoly.burakov@intel.com>; viktorin@rehivetech.com; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ruifeng.wang@arm.com; Xing, Beilei <beilei.xing@intel.com>; Guo, Jia <jia.guo@intel.com>; Yang, Qiming <qiming.yang@intel.com>;
> Wang, Haiyue <haiyue.wang@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Hunt, David <david.hunt@intel.com>;
> jerinjacobk@gmail.com; nhorman@tuxdriver.com; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage
> <gage.eads@intel.com>; drc@linux.vnet.ibm.com; Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; jerinj@marvell.com; hemant.agrawal@nxp.com; viacheslavo@nvidia.com; matan@nvidia.com;
> ajit.khaparde@broadcom.com; rahul.lakkireddy@chelsio.com; johndale@cisco.com; xavier.huwei@huawei.com; shahafs@nvidia.com;
> sthemmin@microsoft.com; g.singh@nxp.com; rmody@marvell.com; maxime.coquelin@redhat.com; david.marchand@redhat.com
> Subject: Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
> 
> 27/10/2020 18:43, Ananyev, Konstantin:
> > > 27/10/2020 12:15, Liang, Ma:
> > > > > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > > > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > > > > +/**
> > > > > > + * Retrieve the wake up address for the receive queue.
> > > > >
> > > > > I guess how this function should be used,
> > > > > but a bit more explanations would not hurt here.
> > > > agree
> > > > > > + *
> > > > > > + * @param port_id
> > > > > > + *   The port identifier of the Ethernet device.
> > > > > > + * @param queue_id
> > > > > > + *   The Rx queue on the Ethernet device for which information will be
> > > > > > + *   retrieved.
> > > > > > + * @param wake_addr
> > > > > > + *   The pointer to the address which will be monitored.
> > > > >
> > > > > This function does not make the address monitored, right?
> > > > This function only get the target wakeup address. that does not monitor this address.
> > > > >
> > > > > > + * @param expected
> > > > > > + *   The pointer to value to be expected when descriptor is set.
> > > > >
> > > > > Not sure we should restrict it to a "descriptor".
> > > >   actully that is not limited to a descriptor, any writeback content should work.
> > > > >
> > > > > Expecting a value or some bits looks too much restrictive.
> > > > > I understand it probably fits well for Intel NICs,
> > > > > but in the general case, we can imagine that any change
> > > > > in a byte array could be a wake up signal.
> > > >
> > > > this parameter doesn not limited user how to use it.
> > > > In fact, current design can support any bits change within 64 bits content.
> > >
> > > How the driver can specify that any value change should be monitored?
> > > I understand that it is only a value/mask pair,
> > > it does not give room for "any value".
> >
> > As I can read the code, value=0, mask=0 will provide you with 'any value'.
> > Though it would mean that rte_power_monitor() will *always* go into sleep,
> > so not sure what will be there any practical usage for such case.
> 
> I think what is missing is to allow waking up when the value
> of a byte array is changing, without specifiying any value.


I think it will always wakeup on any write to wait_addr.
What you control with value/mask pair - when we should go to sleep.
In other words:
ret = rte_eth_get_wake_addr(port, queue, &wait_addr, &value, &mask, ....);

mask==0: always go to sleep, wakeup at any store to wait_addr.
mask!=0: go to sleep only if (*wait_addr & mask) == value, wakeup at any store to wait_addr.   

Liang, Anatoly - feel free to correct me here, if I missed something.


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
  2020-10-27 23:29                           ` Ananyev, Konstantin
@ 2020-10-28  3:24                             ` Ajit Khaparde
  2020-10-28 12:24                             ` Liang, Ma
  1 sibling, 0 replies; 421+ messages in thread
From: Ajit Khaparde @ 2020-10-28  3:24 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Thomas Monjalon, Ma, Liang J, dev, Burakov, Anatoly, viktorin,
	Zhang, Qi Z, ruifeng.wang, Xing, Beilei, Guo, Jia, Yang, Qiming,
	Wang, Haiyue, Richardson, Bruce, Hunt, David, jerinjacobk,
	nhorman, McDaniel, Timothy, Eads, Gage, drc, Andrew Rybchenko,
	Yigit, Ferruh, jerinj, hemant.agrawal, viacheslavo, matan,
	rahul.lakkireddy, johndale, xavier.huwei, shahafs, sthemmin,
	g.singh, rmody, maxime.coquelin, david.marchand

On Tue, Oct 27, 2020 at 4:29 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Thomas Monjalon <thomas@monjalon.net>
> > Sent: Tuesday, October 27, 2020 6:31 PM
> > To: Ma, Liang J <liang.j.ma@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: dev@dpdk.org; Burakov, Anatoly <anatoly.burakov@intel.com>; viktorin@rehivetech.com; Zhang, Qi Z <qi.z.zhang@intel.com>;
> > ruifeng.wang@arm.com; Xing, Beilei <beilei.xing@intel.com>; Guo, Jia <jia.guo@intel.com>; Yang, Qiming <qiming.yang@intel.com>;
> > Wang, Haiyue <haiyue.wang@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Hunt, David <david.hunt@intel.com>;
> > jerinjacobk@gmail.com; nhorman@tuxdriver.com; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage
> > <gage.eads@intel.com>; drc@linux.vnet.ibm.com; Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Yigit, Ferruh
> > <ferruh.yigit@intel.com>; jerinj@marvell.com; hemant.agrawal@nxp.com; viacheslavo@nvidia.com; matan@nvidia.com;
> > ajit.khaparde@broadcom.com; rahul.lakkireddy@chelsio.com; johndale@cisco.com; xavier.huwei@huawei.com; shahafs@nvidia.com;
> > sthemmin@microsoft.com; g.singh@nxp.com; rmody@marvell.com; maxime.coquelin@redhat.com; david.marchand@redhat.com
> > Subject: Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
> >
> > 27/10/2020 18:43, Ananyev, Konstantin:
> > > > 27/10/2020 12:15, Liang, Ma:
> > > > > > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > > > > > +/**
> > > > > > > + * Retrieve the wake up address for the receive queue.
> > > > > >
> > > > > > I guess how this function should be used,
> > > > > > but a bit more explanations would not hurt here.
> > > > > agree
> > > > > > > + *
> > > > > > > + * @param port_id
> > > > > > > + *   The port identifier of the Ethernet device.
> > > > > > > + * @param queue_id
> > > > > > > + *   The Rx queue on the Ethernet device for which information will be
> > > > > > > + *   retrieved.
> > > > > > > + * @param wake_addr
> > > > > > > + *   The pointer to the address which will be monitored.
> > > > > >
> > > > > > This function does not make the address monitored, right?
> > > > > This function only get the target wakeup address. that does not monitor this address.
> > > > > >
> > > > > > > + * @param expected
> > > > > > > + *   The pointer to value to be expected when descriptor is set.
> > > > > >
> > > > > > Not sure we should restrict it to a "descriptor".
> > > > >   actully that is not limited to a descriptor, any writeback content should work.
> > > > > >
> > > > > > Expecting a value or some bits looks too much restrictive.
> > > > > > I understand it probably fits well for Intel NICs,
> > > > > > but in the general case, we can imagine that any change
> > > > > > in a byte array could be a wake up signal.
> > > > >
> > > > > this parameter doesn not limited user how to use it.
> > > > > In fact, current design can support any bits change within 64 bits content.
> > > >
> > > > How the driver can specify that any value change should be monitored?
> > > > I understand that it is only a value/mask pair,
> > > > it does not give room for "any value".
> > >
> > > As I can read the code, value=0, mask=0 will provide you with 'any value'.
> > > Though it would mean that rte_power_monitor() will *always* go into sleep,
> > > so not sure what will be there any practical usage for such case.
> >
> > I think what is missing is to allow waking up when the value
> > of a byte array is changing, without specifiying any value.
>
>
> I think it will always wakeup on any write to wait_addr.
> What you control with value/mask pair - when we should go to sleep.
> In other words:
> ret = rte_eth_get_wake_addr(port, queue, &wait_addr, &value, &mask, ....);
>
> mask==0: always go to sleep, wakeup at any store to wait_addr.
> mask!=0: go to sleep only if (*wait_addr & mask) == value, wakeup at any store to wait_addr.
I did not get this impression on reading it first time. Till you put
it this way.
The comment "if the masked value is already matching, abort" stumped me.

>
> Liang, Anatoly - feel free to correct me here, if I missed something.
>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-27 20:53                   ` Ajit Khaparde
@ 2020-10-28 12:13                     ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-10-28 12:13 UTC (permalink / raw)
  To: Ajit Khaparde
  Cc: dpdk-dev, Ruifeng Wang, Haiyue Wang, Bruce Richardson, Ananyev,
	Konstantin, david.hunt, Jerin Jacob, Neil Horman,
	Thomas Monjalon, timothy.mcdaniel, gage.eads, Marcin Wojtas,
	Guy Tzalik, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, Matan Azrad, Yong Wang

On 27 Oct 13:53, Ajit Khaparde wrote:
> On Tue, Oct 27, 2020 at 7:59 AM Liang Ma <liang.j.ma@intel.com> wrote:
> >
> > This patchset proposes a simple API for Ethernet drivers
> > to cause the CPU to enter a power-optimized state while
> > waiting for packets to arrive, along with a set of
> > generic intrinsics that facilitate that. This is achieved
> > through cooperation with the NIC driver that will allow
> > us to know address of wake up event, and wait for writes
> > on it.
> Is the wake event the same as ring status or interrupt status register?
  the wake event is any change happen to the monitoring address range. 
  can be a ring descriptor(e.g. Done Bits) but not limiti to that. 
> 
> So in a way the PMD is passing the address of the next ring descriptor?
> So that instead of the PMD polling it, the application peeks at it and
> when ready asks the PMD to actually process the packet(s)?
  NO, after PMD pssing the address, processor will monitoring and wait on
  the address range(aka, move to a sub-state). any change to the address
  range caused by other source(another processor, NIC etc) will wake up the
  processor then start polling again. 
> 
> >
> > On IA, this is achieved through using UMONITOR/UMWAIT
> > instructions. They are used in their raw opcode form
> > because there is no widespread compiler support for
> > them yet. Still, the API is made generic enough to
> > hopefully support other architectures, if they happen
> > to implement similar instructions.
> >
> > To achieve power savings, there is a very simple mechanism
> > used: we're counting empty polls, and if a certain threshold
> > is reached, we get the address of next RX ring descriptor
> > from the NIC driver, arm the monitoring hardware, and
> > enter a power-optimized state. We will then wake up when
> > either a timeout happens, or a write happens (or generally
> > whenever CPU feels like waking up - this is platform-
> > specific), and proceed as normal. The empty poll counter is
> > reset whenever we actually get packets, so we only go to
> > sleep when we know nothing is going on. The mechanism is
> > generic which can be used for any write back descriptor.
> >
> > Why are we putting it into ethdev as opposed to leaving
> > this up to the application? Our customers specifically
> > requested a way to do it wit minimal changes to the
> > application code. The current approach allows to just
> > flip a switch and automatically have power savings.
> The application still has to know address of wake up event. Right?
> And then it will need the logic to count empty polls and the threshold?
> This will be done by application or something else?
  the empty poll counting is done by the power library callback function
  which is included in the patch 5. APP has no need change any thing beside
  the initilization code(call the enable/disable API). please Ref the patch 9
  which demostrate how to use it with l3fwd-power.
> 
> >
> > - Only 1:1 core to queue mapping is supported,
> >   meaning that each lcore must at most handle RX on a
> >   single queue
> > - Support 3 type policies. UMWAIT/PAUSE/Frequency_Scale
> > - Power management is enabled per-queue
> > - The API doesn't extend to other device types
> >
> > Liang Ma (9):
> >   eal: add new x86 cpuid support for WAITPKG
> >   eal: add power management intrinsics
> >   eal: add intrinsics support check infrastructure
> >   ethdev: add simple power management API
> >   power: add PMD power management API and callback
> >   net/ixgbe: implement power management API
> >   net/i40e: implement power management API
> >   net/ice: implement power management API
> >   examples/l3fwd-power: enable PMD power mgmt
> >
> >  doc/guides/prog_guide/power_man.rst           |  48 +++
> >  doc/guides/rel_notes/release_20_11.rst        |  15 +
> >  .../sample_app_ug/l3_forward_power_man.rst    |  13 +
> >  drivers/net/i40e/i40e_ethdev.c                |   1 +
> >  drivers/net/i40e/i40e_rxtx.c                  |  26 ++
> >  drivers/net/i40e/i40e_rxtx.h                  |   2 +
> >  drivers/net/ice/ice_ethdev.c                  |   1 +
> >  drivers/net/ice/ice_rxtx.c                    |  26 ++
> >  drivers/net/ice/ice_rxtx.h                    |   2 +
> >  drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
> >  drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
> >  drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
> >  examples/l3fwd-power/main.c                   |  46 ++-
> >  lib/librte_eal/arm/include/meson.build        |   1 +
> >  .../arm/include/rte_power_intrinsics.h        |  60 ++++
> >  lib/librte_eal/arm/rte_cpuflags.c             |   6 +
> >  lib/librte_eal/include/generic/rte_cpuflags.h |  26 ++
> >  .../include/generic/rte_power_intrinsics.h    | 123 +++++++
> >  lib/librte_eal/include/meson.build            |   1 +
> >  lib/librte_eal/ppc/include/meson.build        |   1 +
> >  .../ppc/include/rte_power_intrinsics.h        |  60 ++++
> >  lib/librte_eal/ppc/rte_cpuflags.c             |   7 +
> >  lib/librte_eal/version.map                    |   1 +
> >  lib/librte_eal/x86/include/meson.build        |   1 +
> >  lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
> >  .../x86/include/rte_power_intrinsics.h        | 135 ++++++++
> >  lib/librte_eal/x86/rte_cpuflags.c             |  14 +
> >  lib/librte_ethdev/rte_ethdev.c                |  23 ++
> >  lib/librte_ethdev/rte_ethdev.h                |  33 ++
> >  lib/librte_ethdev/rte_ethdev_driver.h         |  28 ++
> >  lib/librte_ethdev/version.map                 |   1 +
> >  lib/librte_power/meson.build                  |   5 +-
> >  lib/librte_power/rte_power_pmd_mgmt.c         | 320 ++++++++++++++++++
> >  lib/librte_power/rte_power_pmd_mgmt.h         |  92 +++++
> >  lib/librte_power/version.map                  |   4 +
> >  35 files changed, 1148 insertions(+), 3 deletions(-)
> >  create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
> >  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
> >  create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
> >  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
> >  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
> >  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
> >
> > --
> > 2.17.1
> >



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
  2020-10-27 23:29                           ` Ananyev, Konstantin
  2020-10-28  3:24                             ` Ajit Khaparde
@ 2020-10-28 12:24                             ` Liang, Ma
  1 sibling, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-10-28 12:24 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Thomas Monjalon, dev, Burakov, Anatoly, viktorin, Zhang, Qi Z,
	ruifeng.wang, Xing, Beilei, Guo, Jia, Yang, Qiming, Wang, Haiyue,
	Richardson, Bruce, Hunt, David, jerinjacobk, nhorman, McDaniel,
	Timothy, Eads, Gage, drc, Andrew Rybchenko, Yigit, Ferruh,
	jerinj, hemant.agrawal, viacheslavo, matan, ajit.khaparde,
	rahul.lakkireddy, johndale, xavier.huwei, shahafs, sthemmin,
	g.singh, rmody, maxime.coquelin, david.marchand

On 27 Oct 16:29, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: Thomas Monjalon <thomas@monjalon.net>
> > Sent: Tuesday, October 27, 2020 6:31 PM
> > To: Ma, Liang J <liang.j.ma@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: dev@dpdk.org; Burakov, Anatoly <anatoly.burakov@intel.com>; viktorin@rehivetech.com; Zhang, Qi Z <qi.z.zhang@intel.com>;
> > ruifeng.wang@arm.com; Xing, Beilei <beilei.xing@intel.com>; Guo, Jia <jia.guo@intel.com>; Yang, Qiming <qiming.yang@intel.com>;
> > Wang, Haiyue <haiyue.wang@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Hunt, David <david.hunt@intel.com>;
> > jerinjacobk@gmail.com; nhorman@tuxdriver.com; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage
> > <gage.eads@intel.com>; drc@linux.vnet.ibm.com; Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Yigit, Ferruh
> > <ferruh.yigit@intel.com>; jerinj@marvell.com; hemant.agrawal@nxp.com; viacheslavo@nvidia.com; matan@nvidia.com;
> > ajit.khaparde@broadcom.com; rahul.lakkireddy@chelsio.com; johndale@cisco.com; xavier.huwei@huawei.com; shahafs@nvidia.com;
> > sthemmin@microsoft.com; g.singh@nxp.com; rmody@marvell.com; maxime.coquelin@redhat.com; david.marchand@redhat.com
> > Subject: Re: [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API
> >
> > 27/10/2020 18:43, Ananyev, Konstantin:
> > > > 27/10/2020 12:15, Liang, Ma:
> > > > > > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > > > > > +/**
> > > > > > > + * Retrieve the wake up address for the receive queue.
> > > > > >
> > > > > > I guess how this function should be used,
> > > > > > but a bit more explanations would not hurt here.
> > > > > agree
> > > > > > > + *
> > > > > > > + * @param port_id
> > > > > > > + *   The port identifier of the Ethernet device.
> > > > > > > + * @param queue_id
> > > > > > > + *   The Rx queue on the Ethernet device for which information will be
> > > > > > > + *   retrieved.
> > > > > > > + * @param wake_addr
> > > > > > > + *   The pointer to the address which will be monitored.
> > > > > >
> > > > > > This function does not make the address monitored, right?
> > > > > This function only get the target wakeup address. that does not monitor this address.
> > > > > >
> > > > > > > + * @param expected
> > > > > > > + *   The pointer to value to be expected when descriptor is set.
> > > > > >
> > > > > > Not sure we should restrict it to a "descriptor".
> > > > >   actully that is not limited to a descriptor, any writeback content should work.
> > > > > >
> > > > > > Expecting a value or some bits looks too much restrictive.
> > > > > > I understand it probably fits well for Intel NICs,
> > > > > > but in the general case, we can imagine that any change
> > > > > > in a byte array could be a wake up signal.
> > > > >
> > > > > this parameter doesn not limited user how to use it.
> > > > > In fact, current design can support any bits change within 64 bits content.
> > > >
> > > > How the driver can specify that any value change should be monitored?
> > > > I understand that it is only a value/mask pair,
> > > > it does not give room for "any value".
> > >
> > > As I can read the code, value=0, mask=0 will provide you with 'any value'.
> > > Though it would mean that rte_power_monitor() will *always* go into sleep,
> > > so not sure what will be there any practical usage for such case.
> >
> > I think what is missing is to allow waking up when the value
> > of a byte array is changing, without specifiying any value.
> 
> 
> I think it will always wakeup on any write to wait_addr.
> What you control with value/mask pair - when we should go to sleep.
> In other words:
> ret = rte_eth_get_wake_addr(port, queue, &wait_addr, &value, &mask, ....);
> 
> mask==0: always go to sleep, wakeup at any store to wait_addr.
> mask!=0: go to sleep only if (*wait_addr & mask) == value, wakeup at any store to wait_addr.
> 
> Liang, Anatoly - feel free to correct me here, if I missed something.
  mask and expected value is used to prevent race condition with NIC. 
  there is a interval between code get target address and issue umwait action. 
  therefore, the mask is a selector of the bits, if the current value  & mask already
  == the expected value just before issue the umwait,  that mean that address is no 
  longer useful for umwait because of DMA has happend.
  then the code path is abort and back to normal logic.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-27 16:02                   ` Thomas Monjalon
@ 2020-10-28 13:35                     ` Liang, Ma
  2020-10-28 13:49                       ` Jerin Jacob
  0 siblings, 1 reply; 421+ messages in thread
From: Liang, Ma @ 2020-10-28 13:35 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, ruifeng.wang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra,
	johndale, xavier.huwei, xuanziyang2, matan, yongwang

Hi Thomas,
   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible. This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?

Regards
Liang

On 27 Oct 17:02, Thomas Monjalon wrote:
> Please be more patient.
> I asked some questions on v9 (ethdev API is not generic enough),
> and you send a v10 in the same minute you reply,
> without making sure I agree with your answers.
> 
> 
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 13:35                     ` Liang, Ma
@ 2020-10-28 13:49                       ` Jerin Jacob
  2020-10-28 14:21                         ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-28 13:49 UTC (permalink / raw)
  To: Liang, Ma
  Cc: Thomas Monjalon, dpdk-dev, Ruifeng Wang (Arm Technology China),
	Haiyue Wang, Richardson, Bruce, Ananyev, Konstantin, David Hunt,
	Neil Horman, McDaniel, Timothy, Gage Eads, Marcin Wojtas,
	Guy Tzalik, Ajit Khaparde, Harman Kalra, John Daley,
	Wei Hu (Xavier, Ziyang Xuan, matan, Yong Wang

On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
>
> Hi Thomas,
>   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.

I think, From the architecture point of view, the specific
functionally of UMONITOR may not be abstracted.
But from the ethdev callback point of view, Can it be abstracted in
such a way that packet notification available through
checking interrupt status register or ring descriptor location, etc by
the driver. Use that callback as a notification mechanism rather
than defining a memory-based scheme that UMONITOR expects? or similar
thoughts on abstraction.


> This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
>
> Regards
> Liang
>
> On 27 Oct 17:02, Thomas Monjalon wrote:
> > Please be more patient.
> > I asked some questions on v9 (ethdev API is not generic enough),
> > and you send a v10 in the same minute you reply,
> > without making sure I agree with your answers.
> >
> >
> >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 13:49                       ` Jerin Jacob
@ 2020-10-28 14:21                         ` Thomas Monjalon
  2020-10-28 14:57                           ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-28 14:21 UTC (permalink / raw)
  To: Liang, Ma, Jerin Jacob
  Cc: dpdk-dev, Ruifeng Wang (Arm Technology China),
	Haiyue Wang, Richardson, Bruce, Ananyev, Konstantin, David Hunt,
	Neil Horman, McDaniel, Timothy, Gage Eads, Marcin Wojtas,
	Guy Tzalik, Ajit Khaparde, Harman Kalra, John Daley,
	Wei Hu (Xavier, Ziyang Xuan, matan, Yong Wang, david.marchand

28/10/2020 14:49, Jerin Jacob:
> On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> >
> > Hi Thomas,
> >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> 
> I think, From the architecture point of view, the specific
> functionally of UMONITOR may not be abstracted.
> But from the ethdev callback point of view, Can it be abstracted in
> such a way that packet notification available through
> checking interrupt status register or ring descriptor location, etc by
> the driver. Use that callback as a notification mechanism rather
> than defining a memory-based scheme that UMONITOR expects? or similar
> thoughts on abstraction.

I agree with Jerin.
The ethdev API is the blocking problem.
First problem: it is not well explained in doxygen.
Second problem: it is probably not generic enough (if we understand it well)

> > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?

Being experimental is not an excuse to throw something
which is not satisfying.




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 14:21                         ` Thomas Monjalon
@ 2020-10-28 14:57                           ` Ananyev, Konstantin
  2020-10-28 15:14                             ` Jerin Jacob
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-28 14:57 UTC (permalink / raw)
  To: Thomas Monjalon, Ma, Liang J, Jerin Jacob
  Cc: dpdk-dev, Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand



> 28/10/2020 14:49, Jerin Jacob:
> > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > >
> > > Hi Thomas,
> > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> >
> > I think, From the architecture point of view, the specific
> > functionally of UMONITOR may not be abstracted.
> > But from the ethdev callback point of view, Can it be abstracted in
> > such a way that packet notification available through
> > checking interrupt status register or ring descriptor location, etc by
> > the driver. Use that callback as a notification mechanism rather
> > than defining a memory-based scheme that UMONITOR expects? or similar
> > thoughts on abstraction.

I think there is probably some sort of misunderstanding.
This API is not about providing acync notification when next packet arrives.
This is about to putting core to sleep till some event (or timeout) happens.
From my perspective the closest analogy: cond_timedwait().
So we need PMD to tell us what will be the address of the condition variable
we should sleep on.  

> I agree with Jerin.
> The ethdev API is the blocking problem.
> First problem: it is not well explained in doxygen.
> Second problem: it is probably not generic enough (if we understand it well)

It is an address to sleep(/wakeup) on, plus expected value.
Honestly, I can't think-up of anything even more generic then that. 
If you guys have something particular in mind - please share.

> 
> > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> 
> Being experimental is not an excuse to throw something
> which is not satisfying.
> 
> 


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 14:57                           ` Ananyev, Konstantin
@ 2020-10-28 15:14                             ` Jerin Jacob
  2020-10-28 15:30                               ` Liang, Ma
  2020-10-28 15:33                               ` Ananyev, Konstantin
  0 siblings, 2 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-10-28 15:14 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Thomas Monjalon, Ma, Liang J, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand

On Wed, Oct 28, 2020 at 8:27 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
>
> > 28/10/2020 14:49, Jerin Jacob:
> > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > >
> > > > Hi Thomas,
> > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > >
> > > I think, From the architecture point of view, the specific
> > > functionally of UMONITOR may not be abstracted.
> > > But from the ethdev callback point of view, Can it be abstracted in
> > > such a way that packet notification available through
> > > checking interrupt status register or ring descriptor location, etc by
> > > the driver. Use that callback as a notification mechanism rather
> > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > thoughts on abstraction.
>
> I think there is probably some sort of misunderstanding.
> This API is not about providing acync notification when next packet arrives.
> This is about to putting core to sleep till some event (or timeout) happens.
> From my perspective the closest analogy: cond_timedwait().
> So we need PMD to tell us what will be the address of the condition variable
> we should sleep on.
>
> > I agree with Jerin.
> > The ethdev API is the blocking problem.
> > First problem: it is not well explained in doxygen.
> > Second problem: it is probably not generic enough (if we understand it well)
>
> It is an address to sleep(/wakeup) on, plus expected value.
> Honestly, I can't think-up of anything even more generic then that.
> If you guys have something particular in mind - please share.

Current PMD callback:
typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
**tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
*data_sz);

Can we make it as
typedef void (*core_sleep_t)(void *rxq)

if we do such abstraction and "move the polling on memory by HW/CPU"
to the driver using a helper function then
I can think of abstracting in some way in all PMDs.

Note: core_sleep_t can take some more arguments such as enumerated
policy if something more needs to be pushed to the driver.

Thoughts?

>
> >
> > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> >
> > Being experimental is not an excuse to throw something
> > which is not satisfying.
> >
> >
>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 15:14                             ` Jerin Jacob
@ 2020-10-28 15:30                               ` Liang, Ma
  2020-10-28 15:36                                 ` Jerin Jacob
  2020-10-28 15:33                               ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Liang, Ma @ 2020-10-28 15:30 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Ananyev, Konstantin, Thomas Monjalon, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand

On 28 Oct 20:44, Jerin Jacob wrote:
> On Wed, Oct 28, 2020 at 8:27 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> >
> > > 28/10/2020 14:49, Jerin Jacob:
> > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > >
> > > > > Hi Thomas,
> > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > >
> > > > I think, From the architecture point of view, the specific
> > > > functionally of UMONITOR may not be abstracted.
> > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > such a way that packet notification available through
> > > > checking interrupt status register or ring descriptor location, etc by
> > > > the driver. Use that callback as a notification mechanism rather
> > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > thoughts on abstraction.
> >
> > I think there is probably some sort of misunderstanding.
> > This API is not about providing acync notification when next packet arrives.
> > This is about to putting core to sleep till some event (or timeout) happens.
> > From my perspective the closest analogy: cond_timedwait().
> > So we need PMD to tell us what will be the address of the condition variable
> > we should sleep on.
> >
> > > I agree with Jerin.
> > > The ethdev API is the blocking problem.
> > > First problem: it is not well explained in doxygen.
> > > Second problem: it is probably not generic enough (if we understand it well)
> >
> > It is an address to sleep(/wakeup) on, plus expected value.
> > Honestly, I can't think-up of anything even more generic then that.
> > If you guys have something particular in mind - please share.
> 
> Current PMD callback:
> typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> *data_sz);
> 
> Can we make it as
> typedef void (*core_sleep_t)(void *rxq)
How about void (*eth_core_sleep_helper_t)(void *rxq, enum scheme, void *paramter)
by this way, PMD can cast the parameter accorind to the scheme. 
e.g.  if scheme  MEM_MONITOR then cast to umwait way. 
however, this will introduce another problem.
we need add PMD query callback to figure out if PMD support this scheme.
> 
> if we do such abstraction and "move the polling on memory by HW/CPU"
> to the driver using a helper function then
> I can think of abstracting in some way in all PMDs.
> 
> Note: core_sleep_t can take some more arguments such as enumerated
> policy if something more needs to be pushed to the driver.
> 
> Thoughts?
> 
> >
> > >
> > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > >
> > > Being experimental is not an excuse to throw something
> > > which is not satisfying.
> > >
> > >
> >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 15:14                             ` Jerin Jacob
  2020-10-28 15:30                               ` Liang, Ma
@ 2020-10-28 15:33                               ` Ananyev, Konstantin
  2020-10-28 15:39                                 ` Jerin Jacob
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-28 15:33 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Thomas Monjalon, Ma, Liang J, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand



> > > 28/10/2020 14:49, Jerin Jacob:
> > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > >
> > > > > Hi Thomas,
> > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > >
> > > > I think, From the architecture point of view, the specific
> > > > functionally of UMONITOR may not be abstracted.
> > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > such a way that packet notification available through
> > > > checking interrupt status register or ring descriptor location, etc by
> > > > the driver. Use that callback as a notification mechanism rather
> > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > thoughts on abstraction.
> >
> > I think there is probably some sort of misunderstanding.
> > This API is not about providing acync notification when next packet arrives.
> > This is about to putting core to sleep till some event (or timeout) happens.
> > From my perspective the closest analogy: cond_timedwait().
> > So we need PMD to tell us what will be the address of the condition variable
> > we should sleep on.
> >
> > > I agree with Jerin.
> > > The ethdev API is the blocking problem.
> > > First problem: it is not well explained in doxygen.
> > > Second problem: it is probably not generic enough (if we understand it well)
> >
> > It is an address to sleep(/wakeup) on, plus expected value.
> > Honestly, I can't think-up of anything even more generic then that.
> > If you guys have something particular in mind - please share.
> 
> Current PMD callback:
> typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> *data_sz);
> 
> Can we make it as
> typedef void (*core_sleep_t)(void *rxq)
> 
> if we do such abstraction and "move the polling on memory by HW/CPU"
> to the driver using a helper function then
> I can think of abstracting in some way in all PMDs.

Ok I see, thanks for explanation.
From my perspective main disadvantage of such approach -
it can't be extended easily.
If/when will have an ability for core to sleep/wake-up on multiple events
(multiple addresses) will have to either rework that API again. 

> 
> Note: core_sleep_t can take some more arguments such as enumerated
> policy if something more needs to be pushed to the driver.
> 
> Thoughts?
> 
> >
> > >
> > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > >
> > > Being experimental is not an excuse to throw something
> > > which is not satisfying.
> > >
> > >
> >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 15:30                               ` Liang, Ma
@ 2020-10-28 15:36                                 ` Jerin Jacob
  2020-10-28 15:44                                   ` Liang, Ma
  0 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-28 15:36 UTC (permalink / raw)
  To: Liang, Ma
  Cc: Ananyev, Konstantin, Thomas Monjalon, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand

On Wed, Oct 28, 2020 at 9:00 PM Liang, Ma <liang.j.ma@intel.com> wrote:
>
> On 28 Oct 20:44, Jerin Jacob wrote:
> > On Wed, Oct 28, 2020 at 8:27 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> > >
> > >
> > >
> > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > >
> > > > > > Hi Thomas,
> > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > >
> > > > > I think, From the architecture point of view, the specific
> > > > > functionally of UMONITOR may not be abstracted.
> > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > such a way that packet notification available through
> > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > the driver. Use that callback as a notification mechanism rather
> > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > thoughts on abstraction.
> > >
> > > I think there is probably some sort of misunderstanding.
> > > This API is not about providing acync notification when next packet arrives.
> > > This is about to putting core to sleep till some event (or timeout) happens.
> > > From my perspective the closest analogy: cond_timedwait().
> > > So we need PMD to tell us what will be the address of the condition variable
> > > we should sleep on.
> > >
> > > > I agree with Jerin.
> > > > The ethdev API is the blocking problem.
> > > > First problem: it is not well explained in doxygen.
> > > > Second problem: it is probably not generic enough (if we understand it well)
> > >
> > > It is an address to sleep(/wakeup) on, plus expected value.
> > > Honestly, I can't think-up of anything even more generic then that.
> > > If you guys have something particular in mind - please share.
> >
> > Current PMD callback:
> > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > *data_sz);
> >
> > Can we make it as
> > typedef void (*core_sleep_t)(void *rxq)
> How about void (*eth_core_sleep_helper_t)(void *rxq, enum scheme, void *paramter)
> by this way, PMD can cast the parameter accorind to the scheme.
> e.g.  if scheme  MEM_MONITOR then cast to umwait way.
> however, this will introduce another problem.
> we need add PMD query callback to figure out if PMD support this scheme.

I thought scheme/policy something that "application cares" like below
not arch specifics
1) wakeup up on first packet,
2) wake me up on first packet or timeout 100 us etc

Yes. We can have query on type of the policies supported.


> >
> > if we do such abstraction and "move the polling on memory by HW/CPU"
> > to the driver using a helper function then
> > I can think of abstracting in some way in all PMDs.
> >
> > Note: core_sleep_t can take some more arguments such as enumerated
> > policy if something more needs to be pushed to the driver.
> >
> > Thoughts?
> >
> > >
> > > >
> > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > >
> > > > Being experimental is not an excuse to throw something
> > > > which is not satisfying.
> > > >
> > > >
> > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 15:33                               ` Ananyev, Konstantin
@ 2020-10-28 15:39                                 ` Jerin Jacob
  2020-10-28 15:49                                   ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-28 15:39 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Thomas Monjalon, Ma, Liang J, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand

On Wed, Oct 28, 2020 at 9:04 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
>
> > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > >
> > > > > > Hi Thomas,
> > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > >
> > > > > I think, From the architecture point of view, the specific
> > > > > functionally of UMONITOR may not be abstracted.
> > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > such a way that packet notification available through
> > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > the driver. Use that callback as a notification mechanism rather
> > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > thoughts on abstraction.
> > >
> > > I think there is probably some sort of misunderstanding.
> > > This API is not about providing acync notification when next packet arrives.
> > > This is about to putting core to sleep till some event (or timeout) happens.
> > > From my perspective the closest analogy: cond_timedwait().
> > > So we need PMD to tell us what will be the address of the condition variable
> > > we should sleep on.
> > >
> > > > I agree with Jerin.
> > > > The ethdev API is the blocking problem.
> > > > First problem: it is not well explained in doxygen.
> > > > Second problem: it is probably not generic enough (if we understand it well)
> > >
> > > It is an address to sleep(/wakeup) on, plus expected value.
> > > Honestly, I can't think-up of anything even more generic then that.
> > > If you guys have something particular in mind - please share.
> >
> > Current PMD callback:
> > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > *data_sz);
> >
> > Can we make it as
> > typedef void (*core_sleep_t)(void *rxq)
> >
> > if we do such abstraction and "move the polling on memory by HW/CPU"
> > to the driver using a helper function then
> > I can think of abstracting in some way in all PMDs.
>
> Ok I see, thanks for explanation.
> From my perspective main disadvantage of such approach -
> it can't be extended easily.
> If/when will have an ability for core to sleep/wake-up on multiple events
> (multiple addresses) will have to either rework that API again.

I think, we can enumerate the policies and pass the associated
structures as input to the driver.


>
> >
> > Note: core_sleep_t can take some more arguments such as enumerated
> > policy if something more needs to be pushed to the driver.
> >
> > Thoughts?
> >
> > >
> > > >
> > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > >
> > > > Being experimental is not an excuse to throw something
> > > > which is not satisfying.
> > > >
> > > >
> > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 15:36                                 ` Jerin Jacob
@ 2020-10-28 15:44                                   ` Liang, Ma
  2020-10-28 16:01                                     ` Jerin Jacob
  0 siblings, 1 reply; 421+ messages in thread
From: Liang, Ma @ 2020-10-28 15:44 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Ananyev, Konstantin, Thomas Monjalon, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand

On 28 Oct 21:06, Jerin Jacob wrote:
> On Wed, Oct 28, 2020 at 9:00 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> >
> > On 28 Oct 20:44, Jerin Jacob wrote:
> > > On Wed, Oct 28, 2020 at 8:27 PM Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com> wrote:
> > > >
> > > >
> > > >
> > > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > > >
> > > > > > > Hi Thomas,
> > > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > > >
> > > > > > I think, From the architecture point of view, the specific
> > > > > > functionally of UMONITOR may not be abstracted.
> > > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > > such a way that packet notification available through
> > > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > > the driver. Use that callback as a notification mechanism rather
> > > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > > thoughts on abstraction.
> > > >
> > > > I think there is probably some sort of misunderstanding.
> > > > This API is not about providing acync notification when next packet arrives.
> > > > This is about to putting core to sleep till some event (or timeout) happens.
> > > > From my perspective the closest analogy: cond_timedwait().
> > > > So we need PMD to tell us what will be the address of the condition variable
> > > > we should sleep on.
> > > >
> > > > > I agree with Jerin.
> > > > > The ethdev API is the blocking problem.
> > > > > First problem: it is not well explained in doxygen.
> > > > > Second problem: it is probably not generic enough (if we understand it well)
> > > >
> > > > It is an address to sleep(/wakeup) on, plus expected value.
> > > > Honestly, I can't think-up of anything even more generic then that.
> > > > If you guys have something particular in mind - please share.
> > >
> > > Current PMD callback:
> > > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > > *data_sz);
> > >
> > > Can we make it as
> > > typedef void (*core_sleep_t)(void *rxq)
> > How about void (*eth_core_sleep_helper_t)(void *rxq, enum scheme, void *paramter)
> > by this way, PMD can cast the parameter accorind to the scheme.
> > e.g.  if scheme  MEM_MONITOR then cast to umwait way.
> > however, this will introduce another problem.
> > we need add PMD query callback to figure out if PMD support this scheme.
> 
> I thought scheme/policy something that "application cares" like below
> not arch specifics
> 1) wakeup up on first packet,
> 2) wake me up on first packet or timeout 100 us etc
I need clarify about current design a bit. the purposed API just get target address.
the API itself(include the PMD callback) will not demand the processor to idle/sleep.
we use the post rx callback to do that. for ethdev layer, this API only is used  to
fetch target address from PMD, which make minmal impact to existing code.

> Yes. We can have query on type of the policies supported.
> 
> 
> > >
> > > if we do such abstraction and "move the polling on memory by HW/CPU"
> > > to the driver using a helper function then
> > > I can think of abstracting in some way in all PMDs.
> > >
> > > Note: core_sleep_t can take some more arguments such as enumerated
> > > policy if something more needs to be pushed to the driver.
> > >
> > > Thoughts?
> > >
> > > >
> > > > >
> > > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > > >
> > > > > Being experimental is not an excuse to throw something
> > > > > which is not satisfying.
> > > > >
> > > > >
> > > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 15:39                                 ` Jerin Jacob
@ 2020-10-28 15:49                                   ` Ananyev, Konstantin
  2020-10-28 15:57                                     ` Jerin Jacob
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-28 15:49 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Thomas Monjalon, Ma, Liang J, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, October 28, 2020 3:40 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Thomas Monjalon <thomas@monjalon.net>; Ma, Liang J <liang.j.ma@intel.com>; dpdk-dev <dev@dpdk.org>; Ruifeng Wang (Arm
> Technology China) <ruifeng.wang@arm.com>; Wang, Haiyue <haiyue.wang@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>; Hunt, David <david.hunt@intel.com>; Neil Horman <nhorman@tuxdriver.com>; McDaniel, Timothy
> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>; Marcin Wojtas <mw@semihalf.com>; Guy Tzalik
> <gtzalik@amazon.com>; Ajit Khaparde <ajit.khaparde@broadcom.com>; Harman Kalra <hkalra@marvell.com>; John Daley
> <johndale@cisco.com>; Wei Hu (Xavier <xavier.huwei@huawei.com>; Ziyang Xuan <xuanziyang2@huawei.com>; matan@nvidia.com; Yong
> Wang <yongwang@vmware.com>; david.marchand@redhat.com
> Subject: Re: [PATCH v10 0/9] Add PMD power mgmt
> 
> On Wed, Oct 28, 2020 at 9:04 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> >
> > > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > > >
> > > > > > > Hi Thomas,
> > > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > > >
> > > > > > I think, From the architecture point of view, the specific
> > > > > > functionally of UMONITOR may not be abstracted.
> > > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > > such a way that packet notification available through
> > > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > > the driver. Use that callback as a notification mechanism rather
> > > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > > thoughts on abstraction.
> > > >
> > > > I think there is probably some sort of misunderstanding.
> > > > This API is not about providing acync notification when next packet arrives.
> > > > This is about to putting core to sleep till some event (or timeout) happens.
> > > > From my perspective the closest analogy: cond_timedwait().
> > > > So we need PMD to tell us what will be the address of the condition variable
> > > > we should sleep on.
> > > >
> > > > > I agree with Jerin.
> > > > > The ethdev API is the blocking problem.
> > > > > First problem: it is not well explained in doxygen.
> > > > > Second problem: it is probably not generic enough (if we understand it well)
> > > >
> > > > It is an address to sleep(/wakeup) on, plus expected value.
> > > > Honestly, I can't think-up of anything even more generic then that.
> > > > If you guys have something particular in mind - please share.
> > >
> > > Current PMD callback:
> > > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > > *data_sz);
> > >
> > > Can we make it as
> > > typedef void (*core_sleep_t)(void *rxq)
> > >
> > > if we do such abstraction and "move the polling on memory by HW/CPU"
> > > to the driver using a helper function then
> > > I can think of abstracting in some way in all PMDs.
> >
> > Ok I see, thanks for explanation.
> > From my perspective main disadvantage of such approach -
> > it can't be extended easily.
> > If/when will have an ability for core to sleep/wake-up on multiple events
> > (multiple addresses) will have to either rework that API again.
> 
> I think, we can enumerate the policies and pass the associated
> structures as input to the driver.

What I am trying to say: with that API we will not be able to wait
for events from multiple devices (HW queues).
I.E. something like that:

get_wake_addr(port=X, ..., &addr[0], ...);
get_wake_addr(port=Y,..., &addr[1],...);
wait_on_multi(addr, 2); 

wouldn't be possible.

> 
> 
> >
> > >
> > > Note: core_sleep_t can take some more arguments such as enumerated
> > > policy if something more needs to be pushed to the driver.
> > >
> > > Thoughts?
> > >
> > > >
> > > > >
> > > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > > >
> > > > > Being experimental is not an excuse to throw something
> > > > > which is not satisfying.
> > > > >
> > > > >
> > > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 15:49                                   ` Ananyev, Konstantin
@ 2020-10-28 15:57                                     ` Jerin Jacob
  2020-10-28 16:38                                       ` Ananyev, Konstantin
  2020-10-28 16:47                                       ` Liang, Ma
  0 siblings, 2 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-10-28 15:57 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Thomas Monjalon, Ma, Liang J, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand

On Wed, Oct 28, 2020 at 9:19 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Wednesday, October 28, 2020 3:40 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: Thomas Monjalon <thomas@monjalon.net>; Ma, Liang J <liang.j.ma@intel.com>; dpdk-dev <dev@dpdk.org>; Ruifeng Wang (Arm
> > Technology China) <ruifeng.wang@arm.com>; Wang, Haiyue <haiyue.wang@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>; Hunt, David <david.hunt@intel.com>; Neil Horman <nhorman@tuxdriver.com>; McDaniel, Timothy
> > <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>; Marcin Wojtas <mw@semihalf.com>; Guy Tzalik
> > <gtzalik@amazon.com>; Ajit Khaparde <ajit.khaparde@broadcom.com>; Harman Kalra <hkalra@marvell.com>; John Daley
> > <johndale@cisco.com>; Wei Hu (Xavier <xavier.huwei@huawei.com>; Ziyang Xuan <xuanziyang2@huawei.com>; matan@nvidia.com; Yong
> > Wang <yongwang@vmware.com>; david.marchand@redhat.com
> > Subject: Re: [PATCH v10 0/9] Add PMD power mgmt
> >
> > On Wed, Oct 28, 2020 at 9:04 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> > >
> > >
> > >
> > > > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > > > >
> > > > > > > > Hi Thomas,
> > > > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > > > >
> > > > > > > I think, From the architecture point of view, the specific
> > > > > > > functionally of UMONITOR may not be abstracted.
> > > > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > > > such a way that packet notification available through
> > > > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > > > the driver. Use that callback as a notification mechanism rather
> > > > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > > > thoughts on abstraction.
> > > > >
> > > > > I think there is probably some sort of misunderstanding.
> > > > > This API is not about providing acync notification when next packet arrives.
> > > > > This is about to putting core to sleep till some event (or timeout) happens.
> > > > > From my perspective the closest analogy: cond_timedwait().
> > > > > So we need PMD to tell us what will be the address of the condition variable
> > > > > we should sleep on.
> > > > >
> > > > > > I agree with Jerin.
> > > > > > The ethdev API is the blocking problem.
> > > > > > First problem: it is not well explained in doxygen.
> > > > > > Second problem: it is probably not generic enough (if we understand it well)
> > > > >
> > > > > It is an address to sleep(/wakeup) on, plus expected value.
> > > > > Honestly, I can't think-up of anything even more generic then that.
> > > > > If you guys have something particular in mind - please share.
> > > >
> > > > Current PMD callback:
> > > > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > > > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > > > *data_sz);
> > > >
> > > > Can we make it as
> > > > typedef void (*core_sleep_t)(void *rxq)
> > > >
> > > > if we do such abstraction and "move the polling on memory by HW/CPU"
> > > > to the driver using a helper function then
> > > > I can think of abstracting in some way in all PMDs.
> > >
> > > Ok I see, thanks for explanation.
> > > From my perspective main disadvantage of such approach -
> > > it can't be extended easily.
> > > If/when will have an ability for core to sleep/wake-up on multiple events
> > > (multiple addresses) will have to either rework that API again.
> >
> > I think, we can enumerate the policies and pass the associated
> > structures as input to the driver.
>
> What I am trying to say: with that API we will not be able to wait
> for events from multiple devices (HW queues).
> I.E. something like that:
>
> get_wake_addr(port=X, ..., &addr[0], ...);
> get_wake_addr(port=Y,..., &addr[1],...);
> wait_on_multi(addr, 2);
>
> wouldn't be possible.

I see. But the current implementation dictates the only queue bound to
a core. Right?


>
> >
> >
> > >
> > > >
> > > > Note: core_sleep_t can take some more arguments such as enumerated
> > > > policy if something more needs to be pushed to the driver.
> > > >
> > > > Thoughts?
> > > >
> > > > >
> > > > > >
> > > > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > > > >
> > > > > > Being experimental is not an excuse to throw something
> > > > > > which is not satisfying.
> > > > > >
> > > > > >
> > > > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 15:44                                   ` Liang, Ma
@ 2020-10-28 16:01                                     ` Jerin Jacob
  2020-10-28 16:21                                       ` Liang, Ma
  0 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-28 16:01 UTC (permalink / raw)
  To: Liang, Ma
  Cc: Ananyev, Konstantin, Thomas Monjalon, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand

On Wed, Oct 28, 2020 at 9:14 PM Liang, Ma <liang.j.ma@intel.com> wrote:
>
> On 28 Oct 21:06, Jerin Jacob wrote:
> > On Wed, Oct 28, 2020 at 9:00 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > >
> > > On 28 Oct 20:44, Jerin Jacob wrote:
> > > > On Wed, Oct 28, 2020 at 8:27 PM Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > > > >
> > > > > > > > Hi Thomas,
> > > > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > > > >
> > > > > > > I think, From the architecture point of view, the specific
> > > > > > > functionally of UMONITOR may not be abstracted.
> > > > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > > > such a way that packet notification available through
> > > > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > > > the driver. Use that callback as a notification mechanism rather
> > > > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > > > thoughts on abstraction.
> > > > >
> > > > > I think there is probably some sort of misunderstanding.
> > > > > This API is not about providing acync notification when next packet arrives.
> > > > > This is about to putting core to sleep till some event (or timeout) happens.
> > > > > From my perspective the closest analogy: cond_timedwait().
> > > > > So we need PMD to tell us what will be the address of the condition variable
> > > > > we should sleep on.
> > > > >
> > > > > > I agree with Jerin.
> > > > > > The ethdev API is the blocking problem.
> > > > > > First problem: it is not well explained in doxygen.
> > > > > > Second problem: it is probably not generic enough (if we understand it well)
> > > > >
> > > > > It is an address to sleep(/wakeup) on, plus expected value.
> > > > > Honestly, I can't think-up of anything even more generic then that.
> > > > > If you guys have something particular in mind - please share.
> > > >
> > > > Current PMD callback:
> > > > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > > > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > > > *data_sz);
> > > >
> > > > Can we make it as
> > > > typedef void (*core_sleep_t)(void *rxq)
> > > How about void (*eth_core_sleep_helper_t)(void *rxq, enum scheme, void *paramter)
> > > by this way, PMD can cast the parameter accorind to the scheme.
> > > e.g.  if scheme  MEM_MONITOR then cast to umwait way.
> > > however, this will introduce another problem.
> > > we need add PMD query callback to figure out if PMD support this scheme.
> >
> > I thought scheme/policy something that "application cares" like below
> > not arch specifics
> > 1) wakeup up on first packet,
> > 2) wake me up on first packet or timeout 100 us etc
> I need clarify about current design a bit. the purposed API just get target address.
> the API itself(include the PMD callback) will not demand the processor to idle/sleep.
> we use the post rx callback to do that. for ethdev layer, this API only is used  to
> fetch target address from PMD, which make minmal impact to existing code.

I understand that. But if we move that logic to common code in ethdev
as a set of internal
PMD helper functions(Helper functions for class of devices and/or
arch) then it should be
possible. Right?

I do understand that, It will call for some rework in the code. I will
leave up to ethdev maintainers
on specifics.


>
> > Yes. We can have query on type of the policies supported.
> >
> >
> > > >
> > > > if we do such abstraction and "move the polling on memory by HW/CPU"
> > > > to the driver using a helper function then
> > > > I can think of abstracting in some way in all PMDs.
> > > >
> > > > Note: core_sleep_t can take some more arguments such as enumerated
> > > > policy if something more needs to be pushed to the driver.
> > > >
> > > > Thoughts?
> > > >
> > > > >
> > > > > >
> > > > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > > > >
> > > > > > Being experimental is not an excuse to throw something
> > > > > > which is not satisfying.
> > > > > >
> > > > > >
> > > > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 16:01                                     ` Jerin Jacob
@ 2020-10-28 16:21                                       ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-10-28 16:21 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Ananyev, Konstantin, Thomas Monjalon, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand

On 28 Oct 21:31, Jerin Jacob wrote:
> On Wed, Oct 28, 2020 at 9:14 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> >
> > On 28 Oct 21:06, Jerin Jacob wrote:
> > > On Wed, Oct 28, 2020 at 9:00 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > >
> > > > On 28 Oct 20:44, Jerin Jacob wrote:
> > > > > On Wed, Oct 28, 2020 at 8:27 PM Ananyev, Konstantin
> > > > > <konstantin.ananyev@intel.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi Thomas,
> > > > > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > > > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > > > > >
> > > > > > > > I think, From the architecture point of view, the specific
> > > > > > > > functionally of UMONITOR may not be abstracted.
> > > > > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > > > > such a way that packet notification available through
> > > > > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > > > > the driver. Use that callback as a notification mechanism rather
> > > > > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > > > > thoughts on abstraction.
> > > > > >
> > > > > > I think there is probably some sort of misunderstanding.
> > > > > > This API is not about providing acync notification when next packet arrives.
> > > > > > This is about to putting core to sleep till some event (or timeout) happens.
> > > > > > From my perspective the closest analogy: cond_timedwait().
> > > > > > So we need PMD to tell us what will be the address of the condition variable
> > > > > > we should sleep on.
> > > > > >
> > > > > > > I agree with Jerin.
> > > > > > > The ethdev API is the blocking problem.
> > > > > > > First problem: it is not well explained in doxygen.
> > > > > > > Second problem: it is probably not generic enough (if we understand it well)
> > > > > >
> > > > > > It is an address to sleep(/wakeup) on, plus expected value.
> > > > > > Honestly, I can't think-up of anything even more generic then that.
> > > > > > If you guys have something particular in mind - please share.
> > > > >
> > > > > Current PMD callback:
> > > > > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > > > > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > > > > *data_sz);
> > > > >
> > > > > Can we make it as
> > > > > typedef void (*core_sleep_t)(void *rxq)
> > > > How about void (*eth_core_sleep_helper_t)(void *rxq, enum scheme, void *paramter)
> > > > by this way, PMD can cast the parameter accorind to the scheme.
> > > > e.g.  if scheme  MEM_MONITOR then cast to umwait way.
> > > > however, this will introduce another problem.
> > > > we need add PMD query callback to figure out if PMD support this scheme.
> > >
> > > I thought scheme/policy something that "application cares" like below
> > > not arch specifics
> > > 1) wakeup up on first packet,
> > > 2) wake me up on first packet or timeout 100 us etc
> > I need clarify about current design a bit. the purposed API just get target address.
> > the API itself(include the PMD callback) will not demand the processor to idle/sleep.
> > we use the post rx callback to do that. for ethdev layer, this API only is used  to
> > fetch target address from PMD, which make minmal impact to existing code.
> 
> I understand that. But if we move that logic to common code in ethdev
> as a set of internal
> PMD helper functions(Helper functions for class of devices and/or
> arch) then it should be
> possible. Right?
Although that's possible, but power manangment logic will be duplicated in all related PMD.
Advantage of current design is decouple power mgmt logic with PMD, just need minmal help
from PMD. there are 3 scheme already which are implemented as post rx callback. 
Put all power mgmt logic into PMD is way too heavy. Also current design is very easy to extend.
Any new scheme has no impact to ethdev layer unless it need some kind of help from PMD.
we have another 2 scheme, PAUSE and Freq Scale dont not need any PMD help.
> I do understand that, It will call for some rework in the code. I will
> leave up to ethdev maintainers
> on specifics.
> 
> 
> >
> > > Yes. We can have query on type of the policies supported.
> > >
> > >
> > > > >
> > > > > if we do such abstraction and "move the polling on memory by HW/CPU"
> > > > > to the driver using a helper function then
> > > > > I can think of abstracting in some way in all PMDs.
> > > > >
> > > > > Note: core_sleep_t can take some more arguments such as enumerated
> > > > > policy if something more needs to be pushed to the driver.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > > > > >
> > > > > > > Being experimental is not an excuse to throw something
> > > > > > > which is not satisfying.
> > > > > > >
> > > > > > >
> > > > > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 15:57                                     ` Jerin Jacob
@ 2020-10-28 16:38                                       ` Ananyev, Konstantin
  2020-10-28 16:47                                       ` Liang, Ma
  1 sibling, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-28 16:38 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Thomas Monjalon, Ma, Liang J, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand



> 
> On Wed, Oct 28, 2020 at 9:19 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Wednesday, October 28, 2020 3:40 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > Cc: Thomas Monjalon <thomas@monjalon.net>; Ma, Liang J <liang.j.ma@intel.com>; dpdk-dev <dev@dpdk.org>; Ruifeng Wang (Arm
> > > Technology China) <ruifeng.wang@arm.com>; Wang, Haiyue <haiyue.wang@intel.com>; Richardson, Bruce
> > > <bruce.richardson@intel.com>; Hunt, David <david.hunt@intel.com>; Neil Horman <nhorman@tuxdriver.com>; McDaniel, Timothy
> > > <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>; Marcin Wojtas <mw@semihalf.com>; Guy Tzalik
> > > <gtzalik@amazon.com>; Ajit Khaparde <ajit.khaparde@broadcom.com>; Harman Kalra <hkalra@marvell.com>; John Daley
> > > <johndale@cisco.com>; Wei Hu (Xavier <xavier.huwei@huawei.com>; Ziyang Xuan <xuanziyang2@huawei.com>; matan@nvidia.com;
> Yong
> > > Wang <yongwang@vmware.com>; david.marchand@redhat.com
> > > Subject: Re: [PATCH v10 0/9] Add PMD power mgmt
> > >
> > > On Wed, Oct 28, 2020 at 9:04 PM Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com> wrote:
> > > >
> > > >
> > > >
> > > > > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi Thomas,
> > > > > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From
> the
> > > > > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > > > > >
> > > > > > > > I think, From the architecture point of view, the specific
> > > > > > > > functionally of UMONITOR may not be abstracted.
> > > > > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > > > > such a way that packet notification available through
> > > > > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > > > > the driver. Use that callback as a notification mechanism rather
> > > > > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > > > > thoughts on abstraction.
> > > > > >
> > > > > > I think there is probably some sort of misunderstanding.
> > > > > > This API is not about providing acync notification when next packet arrives.
> > > > > > This is about to putting core to sleep till some event (or timeout) happens.
> > > > > > From my perspective the closest analogy: cond_timedwait().
> > > > > > So we need PMD to tell us what will be the address of the condition variable
> > > > > > we should sleep on.
> > > > > >
> > > > > > > I agree with Jerin.
> > > > > > > The ethdev API is the blocking problem.
> > > > > > > First problem: it is not well explained in doxygen.
> > > > > > > Second problem: it is probably not generic enough (if we understand it well)
> > > > > >
> > > > > > It is an address to sleep(/wakeup) on, plus expected value.
> > > > > > Honestly, I can't think-up of anything even more generic then that.
> > > > > > If you guys have something particular in mind - please share.
> > > > >
> > > > > Current PMD callback:
> > > > > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > > > > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > > > > *data_sz);
> > > > >
> > > > > Can we make it as
> > > > > typedef void (*core_sleep_t)(void *rxq)
> > > > >
> > > > > if we do such abstraction and "move the polling on memory by HW/CPU"
> > > > > to the driver using a helper function then
> > > > > I can think of abstracting in some way in all PMDs.
> > > >
> > > > Ok I see, thanks for explanation.
> > > > From my perspective main disadvantage of such approach -
> > > > it can't be extended easily.
> > > > If/when will have an ability for core to sleep/wake-up on multiple events
> > > > (multiple addresses) will have to either rework that API again.
> > >
> > > I think, we can enumerate the policies and pass the associated
> > > structures as input to the driver.
> >
> > What I am trying to say: with that API we will not be able to wait
> > for events from multiple devices (HW queues).
> > I.E. something like that:
> >
> > get_wake_addr(port=X, ..., &addr[0], ...);
> > get_wake_addr(port=Y,..., &addr[1],...);
> > wait_on_multi(addr, 2);
> >
> > wouldn't be possible.
> 
> I see. But the current implementation dictates the only queue bound to
> a core. Right?

Yes, current implementation of rte_power_monitor() supports only single address.
Though proposed API for both ethdev (get_wake_addr) and
power(rte_power_pmd_mgmt_queue_enable) don't dictate
one to one mapping as the only possible usage model.
 
> 
> 
> >
> > >
> > >
> > > >
> > > > >
> > > > > Note: core_sleep_t can take some more arguments such as enumerated
> > > > > policy if something more needs to be pushed to the driver.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > > > > >
> > > > > > > Being experimental is not an excuse to throw something
> > > > > > > which is not satisfying.
> > > > > > >
> > > > > > >
> > > > > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 15:57                                     ` Jerin Jacob
  2020-10-28 16:38                                       ` Ananyev, Konstantin
@ 2020-10-28 16:47                                       ` Liang, Ma
  2020-10-28 16:54                                         ` McDaniel, Timothy
  2020-10-28 17:02                                         ` Ajit Khaparde
  1 sibling, 2 replies; 421+ messages in thread
From: Liang, Ma @ 2020-10-28 16:47 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Ananyev, Konstantin, Thomas Monjalon, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Ajit Khaparde, Harman Kalra, John Daley, Wei Hu (Xavier,
	Ziyang Xuan, matan, Yong Wang, david.marchand

On 28 Oct 21:27, Jerin Jacob wrote:
> On Wed, Oct 28, 2020 at 9:19 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> > > > > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi Thomas,
> > > > > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > > > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > > > > >
> > > > > > > > I think, From the architecture point of view, the specific
> > > > > > > > functionally of UMONITOR may not be abstracted.
> > > > > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > > > > such a way that packet notification available through
> > > > > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > > > > the driver. Use that callback as a notification mechanism rather
> > > > > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > > > > thoughts on abstraction.
> > > > > >
> > > > > > I think there is probably some sort of misunderstanding.
> > > > > > This API is not about providing acync notification when next packet arrives.
> > > > > > This is about to putting core to sleep till some event (or timeout) happens.
> > > > > > From my perspective the closest analogy: cond_timedwait().
> > > > > > So we need PMD to tell us what will be the address of the condition variable
> > > > > > we should sleep on.
> > > > > >
> > > > > > > I agree with Jerin.
> > > > > > > The ethdev API is the blocking problem.
> > > > > > > First problem: it is not well explained in doxygen.
> > > > > > > Second problem: it is probably not generic enough (if we understand it well)
> > > > > >
> > > > > > It is an address to sleep(/wakeup) on, plus expected value.
> > > > > > Honestly, I can't think-up of anything even more generic then that.
> > > > > > If you guys have something particular in mind - please share.
> > > > >
> > > > > Current PMD callback:
> > > > > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > > > > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > > > > *data_sz);
> > > > >
> > > > > Can we make it as
> > > > > typedef void (*core_sleep_t)(void *rxq)
> > > > >
> > > > > if we do such abstraction and "move the polling on memory by HW/CPU"
> > > > > to the driver using a helper function then
> > > > > I can think of abstracting in some way in all PMDs.
> > > >
> > > > Ok I see, thanks for explanation.
> > > > From my perspective main disadvantage of such approach -
> > > > it can't be extended easily.
> > > > If/when will have an ability for core to sleep/wake-up on multiple events
> > > > (multiple addresses) will have to either rework that API again.
> > >
> > > I think, we can enumerate the policies and pass the associated
> > > structures as input to the driver.
> >
> > What I am trying to say: with that API we will not be able to wait
> > for events from multiple devices (HW queues).
> > I.E. something like that:
> >
> > get_wake_addr(port=X, ..., &addr[0], ...);
> > get_wake_addr(port=Y,..., &addr[1],...);
> > wait_on_multi(addr, 2);
> >
> > wouldn't be possible.
> 
> I see. But the current implementation dictates the only queue bound to
> a core. Right?
Current implementation only support 1:1 queue/core mapping is because of 
the limitation of umwait/umonitor which can not work with multiple address
range. However, for other scheme like PASUE/Freq Scale have no such limitation. 
The proposed API itself doesn't limit the 1:1 queue/core mapping. 
> 
> 
> >
> > >
> > >
> > > >
> > > > >
> > > > > Note: core_sleep_t can take some more arguments such as enumerated
> > > > > policy if something more needs to be pushed to the driver.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > > > > >
> > > > > > > Being experimental is not an excuse to throw something
> > > > > > > which is not satisfying.
> > > > > > >
> > > > > > >
> > > > > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 16:47                                       ` Liang, Ma
@ 2020-10-28 16:54                                         ` McDaniel, Timothy
  2020-10-29  9:19                                           ` Liang, Ma
  2020-10-28 17:02                                         ` Ajit Khaparde
  1 sibling, 1 reply; 421+ messages in thread
From: McDaniel, Timothy @ 2020-10-28 16:54 UTC (permalink / raw)
  To: Ma, Liang J, Jerin Jacob
  Cc: Ananyev, Konstantin, Thomas Monjalon, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman, Eads,
	Gage, Marcin Wojtas, Guy Tzalik, Ajit Khaparde, Harman Kalra,
	John Daley, Wei Hu (Xavier, Ziyang Xuan, matan, Yong Wang,
	david.marchand


> -----Original Message-----
> From: Ma, Liang J <liang.j.ma@intel.com>
> Sent: Wednesday, October 28, 2020 11:48 AM
> To: Jerin Jacob <jerinjacobk@gmail.com>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas Monjalon
> <thomas@monjalon.net>; dpdk-dev <dev@dpdk.org>; Ruifeng Wang (Arm
> Technology China) <ruifeng.wang@arm.com>; Wang, Haiyue
> <haiyue.wang@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> Hunt, David <david.hunt@intel.com>; Neil Horman <nhorman@tuxdriver.com>;
> McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage
> <gage.eads@intel.com>; Marcin Wojtas <mw@semihalf.com>; Guy Tzalik
> <gtzalik@amazon.com>; Ajit Khaparde <ajit.khaparde@broadcom.com>;
> Harman Kalra <hkalra@marvell.com>; John Daley <johndale@cisco.com>; Wei
> Hu (Xavier <xavier.huwei@huawei.com>; Ziyang Xuan
> <xuanziyang2@huawei.com>; matan@nvidia.com; Yong Wang
> <yongwang@vmware.com>; david.marchand@redhat.com
> Subject: Re: [PATCH v10 0/9] Add PMD power mgmt
> 
> On 28 Oct 21:27, Jerin Jacob wrote:
> > On Wed, Oct 28, 2020 at 9:19 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> > > > > > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma
> <liang.j.ma@intel.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Thomas,
> > > > > > > > > >   I think I addressed all of the questions in relation to V9. I don't
> think I can solve the issue of a generic API on my own. From the
> > > > > > > > Community Call last week Jerin also said that a generic was
> investigated but that a single solution wasn't feasible.
> > > > > > > > >
> > > > > > > > > I think, From the architecture point of view, the specific
> > > > > > > > > functionally of UMONITOR may not be abstracted.
> > > > > > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > > > > > such a way that packet notification available through
> > > > > > > > > checking interrupt status register or ring descriptor location, etc
> by
> > > > > > > > > the driver. Use that callback as a notification mechanism rather
> > > > > > > > > than defining a memory-based scheme that UMONITOR expects?
> or similar
> > > > > > > > > thoughts on abstraction.
> > > > > > >
> > > > > > > I think there is probably some sort of misunderstanding.
> > > > > > > This API is not about providing acync notification when next packet
> arrives.
> > > > > > > This is about to putting core to sleep till some event (or timeout)
> happens.
> > > > > > > From my perspective the closest analogy: cond_timedwait().
> > > > > > > So we need PMD to tell us what will be the address of the condition
> variable
> > > > > > > we should sleep on.
> > > > > > >
> > > > > > > > I agree with Jerin.
> > > > > > > > The ethdev API is the blocking problem.
> > > > > > > > First problem: it is not well explained in doxygen.
> > > > > > > > Second problem: it is probably not generic enough (if we understand
> it well)
> > > > > > >
> > > > > > > It is an address to sleep(/wakeup) on, plus expected value.
> > > > > > > Honestly, I can't think-up of anything even more generic then that.
> > > > > > > If you guys have something particular in mind - please share.
> > > > > >
> > > > > > Current PMD callback:
> > > > > > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > > > > > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > > > > > *data_sz);
> > > > > >
> > > > > > Can we make it as
> > > > > > typedef void (*core_sleep_t)(void *rxq)
> > > > > >
> > > > > > if we do such abstraction and "move the polling on memory by
> HW/CPU"
> > > > > > to the driver using a helper function then
> > > > > > I can think of abstracting in some way in all PMDs.
> > > > >
> > > > > Ok I see, thanks for explanation.
> > > > > From my perspective main disadvantage of such approach -
> > > > > it can't be extended easily.
> > > > > If/when will have an ability for core to sleep/wake-up on multiple events
> > > > > (multiple addresses) will have to either rework that API again.
> > > >
> > > > I think, we can enumerate the policies and pass the associated
> > > > structures as input to the driver.
> > >
> > > What I am trying to say: with that API we will not be able to wait
> > > for events from multiple devices (HW queues).
> > > I.E. something like that:
> > >
> > > get_wake_addr(port=X, ..., &addr[0], ...);
> > > get_wake_addr(port=Y,..., &addr[1],...);
> > > wait_on_multi(addr, 2);
> > >
> > > wouldn't be possible.
> >
> > I see. But the current implementation dictates the only queue bound to
> > a core. Right?
> Current implementation only support 1:1 queue/core mapping is because of
> the limitation of umwait/umonitor which can not work with multiple address
> range. However, for other scheme like PASUE/Freq Scale have no such
> limitation.
> The proposed API itself doesn't limit the 1:1 queue/core mapping.
> >
> >
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > Note: core_sleep_t can take some more arguments such as
> enumerated
> > > > > > policy if something more needs to be pushed to the driver.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > > This API is experimental and other vendor support can be added
> as needed. If there are any other open issue let me know?
> > > > > > > >
> > > > > > > > Being experimental is not an excuse to throw something
> > > > > > > > which is not satisfying.
> > > > > > > >
> > > > > > > >
> > > > > > >

It would be nice if the low level definition of the UMWAIT and UMONOTOR instructions were split out
into their own inline function or macro so that any PMD could use the intrinsic without being tied to ethdev or
any of the other logic associated with this patch set.  This would be similar to rte_wmb, and so on



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 16:47                                       ` Liang, Ma
  2020-10-28 16:54                                         ` McDaniel, Timothy
@ 2020-10-28 17:02                                         ` Ajit Khaparde
  2020-10-28 18:10                                           ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Ajit Khaparde @ 2020-10-28 17:02 UTC (permalink / raw)
  To: Liang, Ma
  Cc: Jerin Jacob, Ananyev, Konstantin, Thomas Monjalon, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Harman Kalra, John Daley, Wei Hu (Xavier, Ziyang Xuan, matan,
	Yong Wang, david.marchand

On Wed, Oct 28, 2020 at 9:47 AM Liang, Ma <liang.j.ma@intel.com> wrote:
>
> On 28 Oct 21:27, Jerin Jacob wrote:
> > On Wed, Oct 28, 2020 at 9:19 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> > > > > > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Thomas,
> > > > > > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own. From the
> > > > > > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > > > > > >
> > > > > > > > > I think, From the architecture point of view, the specific
> > > > > > > > > functionally of UMONITOR may not be abstracted.
> > > > > > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > > > > > such a way that packet notification available through
> > > > > > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > > > > > the driver. Use that callback as a notification mechanism rather
> > > > > > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > > > > > thoughts on abstraction.
> > > > > > >
> > > > > > > I think there is probably some sort of misunderstanding.
> > > > > > > This API is not about providing acync notification when next packet arrives.
> > > > > > > This is about to putting core to sleep till some event (or timeout) happens.
> > > > > > > From my perspective the closest analogy: cond_timedwait().
> > > > > > > So we need PMD to tell us what will be the address of the condition variable
> > > > > > > we should sleep on.
> > > > > > >
> > > > > > > > I agree with Jerin.
> > > > > > > > The ethdev API is the blocking problem.
> > > > > > > > First problem: it is not well explained in doxygen.
> > > > > > > > Second problem: it is probably not generic enough (if we understand it well)
> > > > > > >
> > > > > > > It is an address to sleep(/wakeup) on, plus expected value.
> > > > > > > Honestly, I can't think-up of anything even more generic then that.
> > > > > > > If you guys have something particular in mind - please share.
> > > > > >
> > > > > > Current PMD callback:
> > > > > > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > > > > > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > > > > > *data_sz);
> > > > > >
> > > > > > Can we make it as
> > > > > > typedef void (*core_sleep_t)(void *rxq)
> > > > > >
> > > > > > if we do such abstraction and "move the polling on memory by HW/CPU"
> > > > > > to the driver using a helper function then
> > > > > > I can think of abstracting in some way in all PMDs.
> > > > >
> > > > > Ok I see, thanks for explanation.
> > > > > From my perspective main disadvantage of such approach -
> > > > > it can't be extended easily.
> > > > > If/when will have an ability for core to sleep/wake-up on multiple events
> > > > > (multiple addresses) will have to either rework that API again.
> > > >
> > > > I think, we can enumerate the policies and pass the associated
> > > > structures as input to the driver.
> > >
> > > What I am trying to say: with that API we will not be able to wait
> > > for events from multiple devices (HW queues).
> > > I.E. something like that:
> > >
> > > get_wake_addr(port=X, ..., &addr[0], ...);
> > > get_wake_addr(port=Y,..., &addr[1],...);
> > > wait_on_multi(addr, 2);
> > >
> > > wouldn't be possible.
> >
> > I see. But the current implementation dictates the only queue bound to
> > a core. Right?
> Current implementation only support 1:1 queue/core mapping is because of
> the limitation of umwait/umonitor which can not work with multiple address
> range. However, for other scheme like PASUE/Freq Scale have no such limitation.
> The proposed API itself doesn't limit the 1:1 queue/core mapping.

The PMD would not know if it is 1:1 queue/core or any other shared scheme.
So the intelligence and decision making is best left to the application.
I think PMD and the underlying hardware does not need to know what kind of
power management scheme is implemented.
IMHO the original API which provides the address, value and mask should suffice.
Any other callback or handshake between PMD and application may be an overkill.

> >
> >
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > Note: core_sleep_t can take some more arguments such as enumerated
> > > > > > policy if something more needs to be pushed to the driver.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > > > > > >
> > > > > > > > Being experimental is not an excuse to throw something
> > > > > > > > which is not satisfying.
> > > > > > > >
> > > > > > > >
> > > > > > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 17:02                                         ` Ajit Khaparde
@ 2020-10-28 18:10                                           ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-28 18:10 UTC (permalink / raw)
  To: Ajit Khaparde, Ma, Liang J
  Cc: Jerin Jacob, Thomas Monjalon, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman,
	McDaniel, Timothy, Eads, Gage, Marcin Wojtas, Guy Tzalik,
	Harman Kalra, John Daley, Wei Hu (Xavier, Ziyang Xuan, matan,
	Yong Wang, david.marchand



> -----Original Message-----
> From: Ajit Khaparde <ajit.khaparde@broadcom.com>
> Sent: Wednesday, October 28, 2020 5:02 PM
> To: Ma, Liang J <liang.j.ma@intel.com>
> Cc: Jerin Jacob <jerinjacobk@gmail.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas Monjalon
> <thomas@monjalon.net>; dpdk-dev <dev@dpdk.org>; Ruifeng Wang (Arm Technology China) <ruifeng.wang@arm.com>; Wang, Haiyue
> <haiyue.wang@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Hunt, David <david.hunt@intel.com>; Neil Horman
> <nhorman@tuxdriver.com>; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>; Marcin Wojtas
> <mw@semihalf.com>; Guy Tzalik <gtzalik@amazon.com>; Harman Kalra <hkalra@marvell.com>; John Daley <johndale@cisco.com>; Wei
> Hu (Xavier <xavier.huwei@huawei.com>; Ziyang Xuan <xuanziyang2@huawei.com>; matan@nvidia.com; Yong Wang
> <yongwang@vmware.com>; david.marchand@redhat.com
> Subject: Re: [PATCH v10 0/9] Add PMD power mgmt
> 
> On Wed, Oct 28, 2020 at 9:47 AM Liang, Ma <liang.j.ma@intel.com> wrote:
> >
> > On 28 Oct 21:27, Jerin Jacob wrote:
> > > On Wed, Oct 28, 2020 at 9:19 PM Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com> wrote:
> > > > > > > > > 28/10/2020 14:49, Jerin Jacob:
> > > > > > > > > > On Wed, Oct 28, 2020 at 7:05 PM Liang, Ma <liang.j.ma@intel.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Thomas,
> > > > > > > > > > >   I think I addressed all of the questions in relation to V9. I don't think I can solve the issue of a generic API on my own.
> From the
> > > > > > > > > Community Call last week Jerin also said that a generic was investigated but that a single solution wasn't feasible.
> > > > > > > > > >
> > > > > > > > > > I think, From the architecture point of view, the specific
> > > > > > > > > > functionally of UMONITOR may not be abstracted.
> > > > > > > > > > But from the ethdev callback point of view, Can it be abstracted in
> > > > > > > > > > such a way that packet notification available through
> > > > > > > > > > checking interrupt status register or ring descriptor location, etc by
> > > > > > > > > > the driver. Use that callback as a notification mechanism rather
> > > > > > > > > > than defining a memory-based scheme that UMONITOR expects? or similar
> > > > > > > > > > thoughts on abstraction.
> > > > > > > >
> > > > > > > > I think there is probably some sort of misunderstanding.
> > > > > > > > This API is not about providing acync notification when next packet arrives.
> > > > > > > > This is about to putting core to sleep till some event (or timeout) happens.
> > > > > > > > From my perspective the closest analogy: cond_timedwait().
> > > > > > > > So we need PMD to tell us what will be the address of the condition variable
> > > > > > > > we should sleep on.
> > > > > > > >
> > > > > > > > > I agree with Jerin.
> > > > > > > > > The ethdev API is the blocking problem.
> > > > > > > > > First problem: it is not well explained in doxygen.
> > > > > > > > > Second problem: it is probably not generic enough (if we understand it well)
> > > > > > > >
> > > > > > > > It is an address to sleep(/wakeup) on, plus expected value.
> > > > > > > > Honestly, I can't think-up of anything even more generic then that.
> > > > > > > > If you guys have something particular in mind - please share.
> > > > > > >
> > > > > > > Current PMD callback:
> > > > > > > typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void
> > > > > > > **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t
> > > > > > > *data_sz);
> > > > > > >
> > > > > > > Can we make it as
> > > > > > > typedef void (*core_sleep_t)(void *rxq)
> > > > > > >
> > > > > > > if we do such abstraction and "move the polling on memory by HW/CPU"
> > > > > > > to the driver using a helper function then
> > > > > > > I can think of abstracting in some way in all PMDs.
> > > > > >
> > > > > > Ok I see, thanks for explanation.
> > > > > > From my perspective main disadvantage of such approach -
> > > > > > it can't be extended easily.
> > > > > > If/when will have an ability for core to sleep/wake-up on multiple events
> > > > > > (multiple addresses) will have to either rework that API again.
> > > > >
> > > > > I think, we can enumerate the policies and pass the associated
> > > > > structures as input to the driver.
> > > >
> > > > What I am trying to say: with that API we will not be able to wait
> > > > for events from multiple devices (HW queues).
> > > > I.E. something like that:
> > > >
> > > > get_wake_addr(port=X, ..., &addr[0], ...);
> > > > get_wake_addr(port=Y,..., &addr[1],...);
> > > > wait_on_multi(addr, 2);
> > > >
> > > > wouldn't be possible.
> > >
> > > I see. But the current implementation dictates the only queue bound to
> > > a core. Right?
> > Current implementation only support 1:1 queue/core mapping is because of
> > the limitation of umwait/umonitor which can not work with multiple address
> > range. However, for other scheme like PASUE/Freq Scale have no such limitation.
> > The proposed API itself doesn't limit the 1:1 queue/core mapping.
> 
> The PMD would not know if it is 1:1 queue/core or any other shared scheme.
> So the intelligence and decision making is best left to the application.
> I think PMD and the underlying hardware does not need to know what kind of
> power management scheme is implemented.

Yep, good point. 100% agree.

> IMHO the original API which provides the address, value and mask should suffice.
> Any other callback or handshake between PMD and application may be an overkill.
> 
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Note: core_sleep_t can take some more arguments such as enumerated
> > > > > > > policy if something more needs to be pushed to the driver.
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > > This API is experimental and other vendor support can be added as needed. If there are any other open issue let me know?
> > > > > > > > >
> > > > > > > > > Being experimental is not an excuse to throw something
> > > > > > > > > which is not satisfying.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-28 16:54                                         ` McDaniel, Timothy
@ 2020-10-29  9:19                                           ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-10-29  9:19 UTC (permalink / raw)
  To: McDaniel, Timothy
  Cc: Jerin Jacob, Ananyev, Konstantin, Thomas Monjalon, dpdk-dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Richardson, Bruce, Hunt, David, Neil Horman, Eads,
	Gage, Marcin Wojtas, Guy Tzalik, Ajit Khaparde, Harman Kalra,
	John Daley, Wei Hu (Xavier, Ziyang Xuan, matan, Yong Wang,
	david.marchand

On 28 Oct 09:54, McDaniel, Timothy wrote:
<snip>
> > > > > > > >
> > > > > > > >
> 
> It would be nice if the low level definition of the UMWAIT and UMONOTOR instructions were split out
> into their own inline function or macro so that any PMD could use the intrinsic without being tied to ethdev or
> any of the other logic associated with this patch set.  This would be similar to rte_wmb, and so on
Current patches split the intrinsics out. and already part of eal lib. any other PMD can re-use it easily.
> 
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 2/9] eal: add power management intrinsics
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 2/9] eal: add power management intrinsics Liang Ma
@ 2020-10-29 17:39                   ` Thomas Monjalon
  0 siblings, 0 replies; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-29 17:39 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, ruifeng.wang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra,
	johndale, xavier.huwei, xuanziyang2, matan, yongwang,
	Anatoly Burakov, Jerin Jacob, Jan Viktorin, David Christensen,
	david.marchand

27/10/2020 15:59, Liang Ma:
> +static inline uint64_t
> +__get_umwait_val(const volatile void *p, const uint8_t sz)
> +{
> +       switch (sz) {
> +       case sizeof(uint8_t):
> +               return *(const volatile uint8_t *)p;
> +       case sizeof(uint16_t):
> +               return *(const volatile uint16_t *)p;
> +       case sizeof(uint32_t):
> +               return *(const volatile uint32_t *)p;
> +       case sizeof(uint64_t):
> +               return *(const volatile uint64_t *)p;
> +       default:
> +               /* this is an intrinsic, so we can't have any error handling */
> +               RTE_ASSERT(0);
> +               return 0;
> +       }
> +}

This function is going to pollute the namespace.
I will fix it to __rte_power_get_umwait_val().




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 0/9] " Liang Ma
  2020-10-27 16:02                   ` Thomas Monjalon
  2020-10-27 20:53                   ` Ajit Khaparde
@ 2020-10-29 17:42                   ` Thomas Monjalon
  2020-10-30  9:36                     ` Liang, Ma
  2 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-29 17:42 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, ruifeng.wang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra,
	johndale, xavier.huwei, xuanziyang2, matan, yongwang

27/10/2020 15:59, Liang Ma:
> Liang Ma (9):
>   eal: add new x86 cpuid support for WAITPKG
>   eal: add power management intrinsics
>   eal: add intrinsics support check infrastructure

EAL patches applied, thanks.

>   ethdev: add simple power management API

Waiting for doxygen reword which may help to get acks for ethdev API.

>   power: add PMD power management API and callback
>   net/ixgbe: implement power management API
>   net/i40e: implement power management API
>   net/ice: implement power management API
>   examples/l3fwd-power: enable PMD power mgmt




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure
  2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure Liang Ma
@ 2020-10-29 21:27                   ` David Marchand
  2020-10-30 10:09                     ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: David Marchand @ 2020-10-29 21:27 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Bruce Richardson, Ananyev, Konstantin, David Hunt,
	Jerin Jacob, Neil Horman, Thomas Monjalon, Timothy McDaniel,
	Gage Eads, Marcin Wojtas, Guy Tzalik, Ajit Khaparde,
	Harman Kalra, John Daley, Wei Hu (Xavier),
	Ziyang Xuan, Matan Azrad, Yong Wang, Anatoly Burakov,
	Jerin Jacob, Jan Viktorin, David Christensen, Ray Kinsella

On Tue, Oct 27, 2020 at 4:00 PM Liang Ma <liang.j.ma@intel.com> wrote:
>
> Currently, it is not possible to check support for intrinsics that
> are platform-specific, cannot be abstracted in a generic way, or do not
> have support on all architectures. The CPUID flags can be used to some
> extent, but they are only defined for their platform, while intrinsics
> will be available to all code as they are in generic headers.
>
> This patch introduces infrastructure to check support for certain
> platform-specific intrinsics, and adds support for checking support for
> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Acked-by: Ray Kinsella <mdr@ashroe.eu>

Coming late to the party, it seems crowded...



> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
> index 872f0ebe3e..28a5aecde8 100644
> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
> @@ -13,6 +13,32 @@
>  #include "rte_common.h"
>  #include <errno.h>
>
> +#include <rte_compat.h>
> +
> +/**
> + * Structure used to describe platform-specific intrinsics that may or may not
> + * be supported at runtime.
> + */
> +struct rte_cpu_intrinsics {
> +       uint32_t power_monitor : 1;
> +       /**< indicates support for rte_power_monitor function */
> +       uint32_t power_pause : 1;
> +       /**< indicates support for rte_power_pause function */
> +};

- The rte_power library is supposed to be built on top of cpuflags.
Not the other way around.
Those capabilities should have been kept inside the rte_power_ API and
not pollute the cpuflags API.

- All of this should have come as a single patch as the previously
introduced API is unusable before.


> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Check CPU support for various intrinsics at runtime.
> + *
> + * @param intrinsics
> + *     Pointer to a structure to be filled.
> + */
> +__rte_experimental
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
> +
>  /**
>   * Enumeration of all CPU features supported
>   */
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> index fb897d9060..03a326f076 100644
> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -32,6 +32,10 @@
>   * checked against the expected value, and if they match, the entering of
>   * optimized power state may be aborted.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *

- Reading this API description... what am I supposed to do in my
application or driver who wants to use the new
rte_power_monitor/rte_power_pause stuff?

I should call rte_cpu_get_features(TOTO) ?
This comment does not give a hint.

I suppose the intent was to refer to the rte_cpu_get_intrinsics_support() thing.
This must be fixed.


- Again, I wonder why we are exposing all this stuff.
This should be hidden in the rte_power API.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-29 17:42                   ` Thomas Monjalon
@ 2020-10-30  9:36                     ` Liang, Ma
  2020-10-30  9:58                       ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Liang, Ma @ 2020-10-30  9:36 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, ruifeng.wang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra,
	johndale, xavier.huwei, xuanziyang2, matan, yongwang

On 29 Oct 18:42, Thomas Monjalon wrote:
> 27/10/2020 15:59, Liang Ma:
> > Liang Ma (9):
> >   eal: add new x86 cpuid support for WAITPKG
> >   eal: add power management intrinsics
> >   eal: add intrinsics support check infrastructure
> 
> EAL patches applied, thanks.
thanks!
> 
> >   ethdev: add simple power management API
> 
> Waiting for doxygen reword which may help to get acks for ethdev API.
we are working on this. I will send out v11 today.
> 
> >   power: add PMD power management API and callback
> >   net/ixgbe: implement power management API
> >   net/i40e: implement power management API
> >   net/ice: implement power management API
> >   examples/l3fwd-power: enable PMD power mgmt
> 
> 
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 0/9] Add PMD power mgmt
  2020-10-30  9:36                     ` Liang, Ma
@ 2020-10-30  9:58                       ` Thomas Monjalon
  0 siblings, 0 replies; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-30  9:58 UTC (permalink / raw)
  To: Liang, Ma
  Cc: dev, ruifeng.wang, haiyue.wang, bruce.richardson,
	konstantin.ananyev, david.hunt, jerinjacobk, nhorman,
	timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra,
	johndale, xavier.huwei, xuanziyang2, matan, yongwang,
	david.marchand

30/10/2020 10:36, Liang, Ma:
> On 29 Oct 18:42, Thomas Monjalon wrote:
> > 27/10/2020 15:59, Liang Ma:
> > > Liang Ma (9):
> > >   eal: add new x86 cpuid support for WAITPKG
> > >   eal: add power management intrinsics
> > >   eal: add intrinsics support check infrastructure
> > 
> > EAL patches applied, thanks.
> thanks!
> > 
> > >   ethdev: add simple power management API
> > 
> > Waiting for doxygen reword which may help to get acks for ethdev API.
> we are working on this. I will send out v11 today.

David sent some interesting questions about the EAL API.
Maybe some explanations are missing, and we could improve the API.
Please reply, I would like to solve the EAL questions quickly.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure
  2020-10-29 21:27                   ` David Marchand
@ 2020-10-30 10:09                     ` Burakov, Anatoly
  2020-10-30 10:14                       ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-30 10:09 UTC (permalink / raw)
  To: David Marchand, Liang Ma
  Cc: dev, Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Bruce Richardson, Ananyev, Konstantin, David Hunt,
	Jerin Jacob, Neil Horman, Thomas Monjalon, Timothy McDaniel,
	Gage Eads, Marcin Wojtas, Guy Tzalik, Ajit Khaparde,
	Harman Kalra, John Daley, Wei Hu (Xavier),
	Ziyang Xuan, Matan Azrad, Yong Wang, Jerin Jacob, Jan Viktorin,
	David Christensen, Ray Kinsella

On 29-Oct-20 9:27 PM, David Marchand wrote:
> On Tue, Oct 27, 2020 at 4:00 PM Liang Ma <liang.j.ma@intel.com> wrote:
>>
>> Currently, it is not possible to check support for intrinsics that
>> are platform-specific, cannot be abstracted in a generic way, or do not
>> have support on all architectures. The CPUID flags can be used to some
>> extent, but they are only defined for their platform, while intrinsics
>> will be available to all code as they are in generic headers.
>>
>> This patch introduces infrastructure to check support for certain
>> platform-specific intrinsics, and adds support for checking support for
>> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
>> Acked-by: Jerin Jacob <jerinj@marvell.com>
>> Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
>> Acked-by: Ray Kinsella <mdr@ashroe.eu>
> 
> Coming late to the party, it seems crowded...
> 
> 
> 
>> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
>> index 872f0ebe3e..28a5aecde8 100644
>> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
>> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
>> @@ -13,6 +13,32 @@
>>   #include "rte_common.h"
>>   #include <errno.h>
>>
>> +#include <rte_compat.h>
>> +
>> +/**
>> + * Structure used to describe platform-specific intrinsics that may or may not
>> + * be supported at runtime.
>> + */
>> +struct rte_cpu_intrinsics {
>> +       uint32_t power_monitor : 1;
>> +       /**< indicates support for rte_power_monitor function */
>> +       uint32_t power_pause : 1;
>> +       /**< indicates support for rte_power_pause function */
>> +};
> 
> - The rte_power library is supposed to be built on top of cpuflags.
> Not the other way around.
> Those capabilities should have been kept inside the rte_power_ API and
> not pollute the cpuflags API.
> 
> - All of this should have come as a single patch as the previously
> introduced API is unusable before.
> 
> 
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Check CPU support for various intrinsics at runtime.
>> + *
>> + * @param intrinsics
>> + *     Pointer to a structure to be filled.
>> + */
>> +__rte_experimental
>> +void
>> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
>> +
>>   /**
>>    * Enumeration of all CPU features supported
>>    */
>> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>> index fb897d9060..03a326f076 100644
>> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
>> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>> @@ -32,6 +32,10 @@
>>    * checked against the expected value, and if they match, the entering of
>>    * optimized power state may be aborted.
>>    *
>> + * @warning It is responsibility of the user to check if this function is
>> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
>> + *   so may result in an illegal CPU instruction error.
>> + *
> 
> - Reading this API description... what am I supposed to do in my
> application or driver who wants to use the new
> rte_power_monitor/rte_power_pause stuff?
> 
> I should call rte_cpu_get_features(TOTO) ?
> This comment does not give a hint.
> 
> I suppose the intent was to refer to the rte_cpu_get_intrinsics_support() thing.
> This must be fixed.
> 
> 
> - Again, I wonder why we are exposing all this stuff.
> This should be hidden in the rte_power API.
> 
> 

We're exposing all of this here because the intrinsics are *not* part of 
the power API but rather are generic headers within EAL. Therefore, any 
infrastructure checking for their support can *not* reside in the power 
library, but instead has to be in EAL.

The intended usage here is to call this function before calling 
rte_power_monitor(), such that:

	struct rte_cpu_intrinsics intrinsics;

	rte_cpu_get_intrinsics_support(&intrinsics);

	if (!intrinsics.power_monitor) {
		// rte_power_monitor not supported and cannot be used
		return;
	}
	// proceed with rte_power_monitor usage

Failing to do that will result in either -ENOTSUP on non-x86, or illegal 
instruction crash on x86 that doesn't have that instruction (because we 
encode raw opcode).

I've *not* added this to the previous patches because i wanted to get 
this part reviewed specifically, and not mix it with other IA-specific 
stuff. It seems that i've succeeded in that goal, as this patch has 4 
likes^W acks :)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure
  2020-10-30 10:09                     ` Burakov, Anatoly
@ 2020-10-30 10:14                       ` Thomas Monjalon
  2020-10-30 13:37                         ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-30 10:14 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: David Marchand, Liang Ma, dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Bruce Richardson, Ananyev, Konstantin, David Hunt,
	Jerin Jacob, Neil Horman, Timothy McDaniel, Gage Eads,
	Marcin Wojtas, Guy Tzalik, Ajit Khaparde, Harman Kalra,
	John Daley, Wei Hu (Xavier),
	Ziyang Xuan, Matan Azrad, Yong Wang, Jerin Jacob, Jan Viktorin,
	David Christensen, Ray Kinsella

30/10/2020 11:09, Burakov, Anatoly:
> On 29-Oct-20 9:27 PM, David Marchand wrote:
> > On Tue, Oct 27, 2020 at 4:00 PM Liang Ma <liang.j.ma@intel.com> wrote:
> >>
> >> Currently, it is not possible to check support for intrinsics that
> >> are platform-specific, cannot be abstracted in a generic way, or do not
> >> have support on all architectures. The CPUID flags can be used to some
> >> extent, but they are only defined for their platform, while intrinsics
> >> will be available to all code as they are in generic headers.
> >>
> >> This patch introduces infrastructure to check support for certain
> >> platform-specific intrinsics, and adds support for checking support for
> >> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> >> Acked-by: Jerin Jacob <jerinj@marvell.com>
> >> Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >> Acked-by: Ray Kinsella <mdr@ashroe.eu>
> > 
> > Coming late to the party, it seems crowded...
> > 
> > 
> > 
> >> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
> >> index 872f0ebe3e..28a5aecde8 100644
> >> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
> >> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
> >> @@ -13,6 +13,32 @@
> >>   #include "rte_common.h"
> >>   #include <errno.h>
> >>
> >> +#include <rte_compat.h>
> >> +
> >> +/**
> >> + * Structure used to describe platform-specific intrinsics that may or may not
> >> + * be supported at runtime.
> >> + */
> >> +struct rte_cpu_intrinsics {
> >> +       uint32_t power_monitor : 1;
> >> +       /**< indicates support for rte_power_monitor function */
> >> +       uint32_t power_pause : 1;
> >> +       /**< indicates support for rte_power_pause function */
> >> +};
> > 
> > - The rte_power library is supposed to be built on top of cpuflags.
> > Not the other way around.
> > Those capabilities should have been kept inside the rte_power_ API and
> > not pollute the cpuflags API.
> > 
> > - All of this should have come as a single patch as the previously
> > introduced API is unusable before.
> > 
> > 
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Check CPU support for various intrinsics at runtime.
> >> + *
> >> + * @param intrinsics
> >> + *     Pointer to a structure to be filled.
> >> + */
> >> +__rte_experimental
> >> +void
> >> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
> >> +
> >>   /**
> >>    * Enumeration of all CPU features supported
> >>    */
> >> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> >> index fb897d9060..03a326f076 100644
> >> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
> >> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> >> @@ -32,6 +32,10 @@
> >>    * checked against the expected value, and if they match, the entering of
> >>    * optimized power state may be aborted.
> >>    *
> >> + * @warning It is responsibility of the user to check if this function is
> >> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> >> + *   so may result in an illegal CPU instruction error.
> >> + *
> > 
> > - Reading this API description... what am I supposed to do in my
> > application or driver who wants to use the new
> > rte_power_monitor/rte_power_pause stuff?
> > 
> > I should call rte_cpu_get_features(TOTO) ?
> > This comment does not give a hint.
> > 
> > I suppose the intent was to refer to the rte_cpu_get_intrinsics_support() thing.
> > This must be fixed.
> > 
> > 
> > - Again, I wonder why we are exposing all this stuff.
> > This should be hidden in the rte_power API.
> > 
> 
> We're exposing all of this here because the intrinsics are *not* part of 
> the power API but rather are generic headers within EAL. Therefore, any 
> infrastructure checking for their support can *not* reside in the power 
> library, but instead has to be in EAL.
> 
> The intended usage here is to call this function before calling 
> rte_power_monitor(), such that:
> 
> 	struct rte_cpu_intrinsics intrinsics;
> 
> 	rte_cpu_get_intrinsics_support(&intrinsics);
> 
> 	if (!intrinsics.power_monitor) {
> 		// rte_power_monitor not supported and cannot be used
> 		return;
> 	}

This check could be done inside the rte_power API.

> 	// proceed with rte_power_monitor usage
> 
> Failing to do that will result in either -ENOTSUP on non-x86, or illegal 
> instruction crash on x86 that doesn't have that instruction (because we 
> encode raw opcode).
> 
> I've *not* added this to the previous patches because i wanted to get 
> this part reviewed specifically, and not mix it with other IA-specific 
> stuff. It seems that i've succeeded in that goal, as this patch has 4 
> likes^W acks :)

You did not explain the need for rte_cpu_get_features() call.




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure
  2020-10-30 10:14                       ` Thomas Monjalon
@ 2020-10-30 13:37                         ` Burakov, Anatoly
  2020-10-30 14:09                           ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-30 13:37 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: David Marchand, Liang Ma, dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Bruce Richardson, Ananyev, Konstantin, David Hunt,
	Jerin Jacob, Neil Horman, Timothy McDaniel, Gage Eads,
	Marcin Wojtas, Guy Tzalik, Ajit Khaparde, Harman Kalra,
	John Daley, Wei Hu (Xavier),
	Ziyang Xuan, Matan Azrad, Yong Wang, Jerin Jacob, Jan Viktorin,
	David Christensen, Ray Kinsella

On 30-Oct-20 10:14 AM, Thomas Monjalon wrote:
> 30/10/2020 11:09, Burakov, Anatoly:
>> On 29-Oct-20 9:27 PM, David Marchand wrote:
>>> On Tue, Oct 27, 2020 at 4:00 PM Liang Ma <liang.j.ma@intel.com> wrote:
>>>>
>>>> Currently, it is not possible to check support for intrinsics that
>>>> are platform-specific, cannot be abstracted in a generic way, or do not
>>>> have support on all architectures. The CPUID flags can be used to some
>>>> extent, but they are only defined for their platform, while intrinsics
>>>> will be available to all code as they are in generic headers.
>>>>
>>>> This patch introduces infrastructure to check support for certain
>>>> platform-specific intrinsics, and adds support for checking support for
>>>> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
>>>>
>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>>>> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
>>>> Acked-by: Jerin Jacob <jerinj@marvell.com>
>>>> Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>> Acked-by: Ray Kinsella <mdr@ashroe.eu>
>>>
>>> Coming late to the party, it seems crowded...
>>>
>>>
>>>
>>>> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
>>>> index 872f0ebe3e..28a5aecde8 100644
>>>> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
>>>> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
>>>> @@ -13,6 +13,32 @@
>>>>    #include "rte_common.h"
>>>>    #include <errno.h>
>>>>
>>>> +#include <rte_compat.h>
>>>> +
>>>> +/**
>>>> + * Structure used to describe platform-specific intrinsics that may or may not
>>>> + * be supported at runtime.
>>>> + */
>>>> +struct rte_cpu_intrinsics {
>>>> +       uint32_t power_monitor : 1;
>>>> +       /**< indicates support for rte_power_monitor function */
>>>> +       uint32_t power_pause : 1;
>>>> +       /**< indicates support for rte_power_pause function */
>>>> +};
>>>
>>> - The rte_power library is supposed to be built on top of cpuflags.
>>> Not the other way around.
>>> Those capabilities should have been kept inside the rte_power_ API and
>>> not pollute the cpuflags API.
>>>
>>> - All of this should have come as a single patch as the previously
>>> introduced API is unusable before.
>>>
>>>
>>>> +
>>>> +/**
>>>> + * @warning
>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>> + *
>>>> + * Check CPU support for various intrinsics at runtime.
>>>> + *
>>>> + * @param intrinsics
>>>> + *     Pointer to a structure to be filled.
>>>> + */
>>>> +__rte_experimental
>>>> +void
>>>> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
>>>> +
>>>>    /**
>>>>     * Enumeration of all CPU features supported
>>>>     */
>>>> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>>> index fb897d9060..03a326f076 100644
>>>> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>>> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>>> @@ -32,6 +32,10 @@
>>>>     * checked against the expected value, and if they match, the entering of
>>>>     * optimized power state may be aborted.
>>>>     *
>>>> + * @warning It is responsibility of the user to check if this function is
>>>> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
>>>> + *   so may result in an illegal CPU instruction error.
>>>> + *
>>>
>>> - Reading this API description... what am I supposed to do in my
>>> application or driver who wants to use the new
>>> rte_power_monitor/rte_power_pause stuff?
>>>
>>> I should call rte_cpu_get_features(TOTO) ?
>>> This comment does not give a hint.
>>>
>>> I suppose the intent was to refer to the rte_cpu_get_intrinsics_support() thing.
>>> This must be fixed.
>>>
>>>
>>> - Again, I wonder why we are exposing all this stuff.
>>> This should be hidden in the rte_power API.
>>>
>>
>> We're exposing all of this here because the intrinsics are *not* part of
>> the power API but rather are generic headers within EAL. Therefore, any
>> infrastructure checking for their support can *not* reside in the power
>> library, but instead has to be in EAL.
>>
>> The intended usage here is to call this function before calling
>> rte_power_monitor(), such that:
>>
>> 	struct rte_cpu_intrinsics intrinsics;
>>
>> 	rte_cpu_get_intrinsics_support(&intrinsics);
>>
>> 	if (!intrinsics.power_monitor) {
>> 		// rte_power_monitor not supported and cannot be used
>> 		return;
>> 	}
> 
> This check could be done inside the rte_power API.

I'm not quite clear on exactly what you're asking here.

Do you mean the example code above? If so, code like that is already 
present in the power library, at the callback enable stage.

If you mean to say, i should put this check into the rte_power_monitor 
intrinsic, then no, i don't think it's a good idea to have this 
expensive check every time you call rte_power_monitor.

If you mean put this entire infrastructure into the power API - well, 
that kinda defeats the purpose of both having these intrinsics in 
generic headers and having a generic CPU feature check infrastructure 
that was requested of us during the review. We of course can move the 
intrinsic to the power library and outside of EAL, but then anything 
that requires UMWAIT will have to depend on the librte_power.

Please clarify exactly what changes you would like to see here, and what 
is your objection.

> 
>> 	// proceed with rte_power_monitor usage
>>
>> Failing to do that will result in either -ENOTSUP on non-x86, or illegal
>> instruction crash on x86 that doesn't have that instruction (because we
>> encode raw opcode).
>>
>> I've *not* added this to the previous patches because i wanted to get
>> this part reviewed specifically, and not mix it with other IA-specific
>> stuff. It seems that i've succeeded in that goal, as this patch has 4
>> likes^W acks :)
> 
> You did not explain the need for rte_cpu_get_features() call.
> 

Did not explain *where*? Are you suggesting i put things about 
rte_power_monitor into documentation for rte_cpu_get_intrinsics_support? 
The documentation for rte_power_monitor already states that one should 
use rte_cpu_get_intrinsics_support API to check if the rte_power_monitor 
is supported on current machine. What else is missing?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure
  2020-10-30 13:37                         ` Burakov, Anatoly
@ 2020-10-30 14:09                           ` Thomas Monjalon
  2020-10-30 15:27                             ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-30 14:09 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: David Marchand, Liang Ma, dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Bruce Richardson, Ananyev, Konstantin, David Hunt,
	Jerin Jacob, Neil Horman, Timothy McDaniel, Gage Eads,
	Marcin Wojtas, Guy Tzalik, Ajit Khaparde, Harman Kalra,
	John Daley, Wei Hu (Xavier),
	Ziyang Xuan, Matan Azrad, Yong Wang, Jerin Jacob, Jan Viktorin,
	David Christensen, Ray Kinsella

30/10/2020 14:37, Burakov, Anatoly:
> On 30-Oct-20 10:14 AM, Thomas Monjalon wrote:
> > 30/10/2020 11:09, Burakov, Anatoly:
> >> On 29-Oct-20 9:27 PM, David Marchand wrote:
> >>> On Tue, Oct 27, 2020 at 4:00 PM Liang Ma <liang.j.ma@intel.com> wrote:
> >>>>
> >>>> Currently, it is not possible to check support for intrinsics that
> >>>> are platform-specific, cannot be abstracted in a generic way, or do not
> >>>> have support on all architectures. The CPUID flags can be used to some
> >>>> extent, but they are only defined for their platform, while intrinsics
> >>>> will be available to all code as they are in generic headers.
> >>>>
> >>>> This patch introduces infrastructure to check support for certain
> >>>> platform-specific intrinsics, and adds support for checking support for
> >>>> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
> >>>>
> >>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >>>> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> >>>> Acked-by: Jerin Jacob <jerinj@marvell.com>
> >>>> Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>> Acked-by: Ray Kinsella <mdr@ashroe.eu>
> >>>
> >>> Coming late to the party, it seems crowded...
> >>>
> >>>
> >>>
> >>>> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
> >>>> index 872f0ebe3e..28a5aecde8 100644
> >>>> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
> >>>> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
> >>>> @@ -13,6 +13,32 @@
> >>>>    #include "rte_common.h"
> >>>>    #include <errno.h>
> >>>>
> >>>> +#include <rte_compat.h>
> >>>> +
> >>>> +/**
> >>>> + * Structure used to describe platform-specific intrinsics that may or may not
> >>>> + * be supported at runtime.
> >>>> + */
> >>>> +struct rte_cpu_intrinsics {
> >>>> +       uint32_t power_monitor : 1;
> >>>> +       /**< indicates support for rte_power_monitor function */
> >>>> +       uint32_t power_pause : 1;
> >>>> +       /**< indicates support for rte_power_pause function */
> >>>> +};
> >>>
> >>> - The rte_power library is supposed to be built on top of cpuflags.
> >>> Not the other way around.
> >>> Those capabilities should have been kept inside the rte_power_ API and
> >>> not pollute the cpuflags API.
> >>>
> >>> - All of this should have come as a single patch as the previously
> >>> introduced API is unusable before.
> >>>
> >>>
> >>>> +
> >>>> +/**
> >>>> + * @warning
> >>>> + * @b EXPERIMENTAL: this API may change without prior notice
> >>>> + *
> >>>> + * Check CPU support for various intrinsics at runtime.
> >>>> + *
> >>>> + * @param intrinsics
> >>>> + *     Pointer to a structure to be filled.
> >>>> + */
> >>>> +__rte_experimental
> >>>> +void
> >>>> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
> >>>> +
> >>>>    /**
> >>>>     * Enumeration of all CPU features supported
> >>>>     */
> >>>> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> >>>> index fb897d9060..03a326f076 100644
> >>>> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
> >>>> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> >>>> @@ -32,6 +32,10 @@
> >>>>     * checked against the expected value, and if they match, the entering of
> >>>>     * optimized power state may be aborted.
> >>>>     *
> >>>> + * @warning It is responsibility of the user to check if this function is
> >>>> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> >>>> + *   so may result in an illegal CPU instruction error.
> >>>> + *
> >>>
> >>> - Reading this API description... what am I supposed to do in my
> >>> application or driver who wants to use the new
> >>> rte_power_monitor/rte_power_pause stuff?
> >>>
> >>> I should call rte_cpu_get_features(TOTO) ?
> >>> This comment does not give a hint.
> >>>
> >>> I suppose the intent was to refer to the rte_cpu_get_intrinsics_support() thing.
> >>> This must be fixed.
> >>>
> >>>
> >>> - Again, I wonder why we are exposing all this stuff.
> >>> This should be hidden in the rte_power API.
> >>>
> >>
> >> We're exposing all of this here because the intrinsics are *not* part of
> >> the power API but rather are generic headers within EAL. Therefore, any
> >> infrastructure checking for their support can *not* reside in the power
> >> library, but instead has to be in EAL.
> >>
> >> The intended usage here is to call this function before calling
> >> rte_power_monitor(), such that:
> >>
> >> 	struct rte_cpu_intrinsics intrinsics;
> >>
> >> 	rte_cpu_get_intrinsics_support(&intrinsics);
> >>
> >> 	if (!intrinsics.power_monitor) {
> >> 		// rte_power_monitor not supported and cannot be used
> >> 		return;
> >> 	}
> > 
> > This check could be done inside the rte_power API.
> 
> I'm not quite clear on exactly what you're asking here.
> 
> Do you mean the example code above? If so, code like that is already 
> present in the power library, at the callback enable stage.
> 
> If you mean to say, i should put this check into the rte_power_monitor 
> intrinsic, then no, i don't think it's a good idea to have this 
> expensive check every time you call rte_power_monitor.

No but it can be done at initialization time.
According to what you say above, it is alread done at callback enable stage.
So the app does not need to do the check?

> If you mean put this entire infrastructure into the power API - well, 
> that kinda defeats the purpose of both having these intrinsics in 
> generic headers and having a generic CPU feature check infrastructure 
> that was requested of us during the review. We of course can move the 
> intrinsic to the power library and outside of EAL, but then anything 
> that requires UMWAIT will have to depend on the librte_power.

Yes the intrinsics can be in EAL if usable.
But it seems DLB author cannot use what is in EAL.

> Please clarify exactly what changes you would like to see here, and what 
> is your objection.
> 
> > 
> >> 	// proceed with rte_power_monitor usage
> >>
> >> Failing to do that will result in either -ENOTSUP on non-x86, or illegal
> >> instruction crash on x86 that doesn't have that instruction (because we
> >> encode raw opcode).
> >>
> >> I've *not* added this to the previous patches because i wanted to get
> >> this part reviewed specifically, and not mix it with other IA-specific
> >> stuff. It seems that i've succeeded in that goal, as this patch has 4
> >> likes^W acks :)
> > 
> > You did not explain the need for rte_cpu_get_features() call.
> > 
> 
> Did not explain *where*? Are you suggesting i put things about 
> rte_power_monitor into documentation for rte_cpu_get_intrinsics_support? 
> The documentation for rte_power_monitor already states that one should 
> use rte_cpu_get_intrinsics_support API to check if the rte_power_monitor 
> is supported on current machine. What else is missing?

In your example above, you do not call rte_cpu_get_features()
which is documented as required in the EAL doc.




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure
  2020-10-30 14:09                           ` Thomas Monjalon
@ 2020-10-30 15:27                             ` Burakov, Anatoly
  2020-10-30 15:44                               ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-30 15:27 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: David Marchand, Liang Ma, dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Bruce Richardson, Ananyev, Konstantin, David Hunt,
	Jerin Jacob, Neil Horman, Timothy McDaniel, Gage Eads,
	Marcin Wojtas, Guy Tzalik, Ajit Khaparde, Harman Kalra,
	John Daley, Wei Hu (Xavier),
	Ziyang Xuan, Matan Azrad, Yong Wang, Jerin Jacob, Jan Viktorin,
	David Christensen, Ray Kinsella

On 30-Oct-20 2:09 PM, Thomas Monjalon wrote:
> 30/10/2020 14:37, Burakov, Anatoly:
>> On 30-Oct-20 10:14 AM, Thomas Monjalon wrote:
>>> 30/10/2020 11:09, Burakov, Anatoly:
>>>> On 29-Oct-20 9:27 PM, David Marchand wrote:
>>>>> On Tue, Oct 27, 2020 at 4:00 PM Liang Ma <liang.j.ma@intel.com> wrote:
>>>>>>
>>>>>> Currently, it is not possible to check support for intrinsics that
>>>>>> are platform-specific, cannot be abstracted in a generic way, or do not
>>>>>> have support on all architectures. The CPUID flags can be used to some
>>>>>> extent, but they are only defined for their platform, while intrinsics
>>>>>> will be available to all code as they are in generic headers.
>>>>>>
>>>>>> This patch introduces infrastructure to check support for certain
>>>>>> platform-specific intrinsics, and adds support for checking support for
>>>>>> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
>>>>>>
>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>>>>>> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
>>>>>> Acked-by: Jerin Jacob <jerinj@marvell.com>
>>>>>> Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>>>> Acked-by: Ray Kinsella <mdr@ashroe.eu>
>>>>>
>>>>> Coming late to the party, it seems crowded...
>>>>>
>>>>>
>>>>>
>>>>>> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
>>>>>> index 872f0ebe3e..28a5aecde8 100644
>>>>>> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
>>>>>> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
>>>>>> @@ -13,6 +13,32 @@
>>>>>>     #include "rte_common.h"
>>>>>>     #include <errno.h>
>>>>>>
>>>>>> +#include <rte_compat.h>
>>>>>> +
>>>>>> +/**
>>>>>> + * Structure used to describe platform-specific intrinsics that may or may not
>>>>>> + * be supported at runtime.
>>>>>> + */
>>>>>> +struct rte_cpu_intrinsics {
>>>>>> +       uint32_t power_monitor : 1;
>>>>>> +       /**< indicates support for rte_power_monitor function */
>>>>>> +       uint32_t power_pause : 1;
>>>>>> +       /**< indicates support for rte_power_pause function */
>>>>>> +};
>>>>>
>>>>> - The rte_power library is supposed to be built on top of cpuflags.
>>>>> Not the other way around.
>>>>> Those capabilities should have been kept inside the rte_power_ API and
>>>>> not pollute the cpuflags API.
>>>>>
>>>>> - All of this should have come as a single patch as the previously
>>>>> introduced API is unusable before.
>>>>>
>>>>>
>>>>>> +
>>>>>> +/**
>>>>>> + * @warning
>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>> + *
>>>>>> + * Check CPU support for various intrinsics at runtime.
>>>>>> + *
>>>>>> + * @param intrinsics
>>>>>> + *     Pointer to a structure to be filled.
>>>>>> + */
>>>>>> +__rte_experimental
>>>>>> +void
>>>>>> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
>>>>>> +
>>>>>>     /**
>>>>>>      * Enumeration of all CPU features supported
>>>>>>      */
>>>>>> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>>>>> index fb897d9060..03a326f076 100644
>>>>>> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>>>>> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>>>>> @@ -32,6 +32,10 @@
>>>>>>      * checked against the expected value, and if they match, the entering of
>>>>>>      * optimized power state may be aborted.
>>>>>>      *
>>>>>> + * @warning It is responsibility of the user to check if this function is
>>>>>> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
>>>>>> + *   so may result in an illegal CPU instruction error.
>>>>>> + *
>>>>>
>>>>> - Reading this API description... what am I supposed to do in my
>>>>> application or driver who wants to use the new
>>>>> rte_power_monitor/rte_power_pause stuff?
>>>>>
>>>>> I should call rte_cpu_get_features(TOTO) ?
>>>>> This comment does not give a hint.
>>>>>
>>>>> I suppose the intent was to refer to the rte_cpu_get_intrinsics_support() thing.
>>>>> This must be fixed.
>>>>>
>>>>>
>>>>> - Again, I wonder why we are exposing all this stuff.
>>>>> This should be hidden in the rte_power API.
>>>>>
>>>>
>>>> We're exposing all of this here because the intrinsics are *not* part of
>>>> the power API but rather are generic headers within EAL. Therefore, any
>>>> infrastructure checking for their support can *not* reside in the power
>>>> library, but instead has to be in EAL.
>>>>
>>>> The intended usage here is to call this function before calling
>>>> rte_power_monitor(), such that:
>>>>
>>>> 	struct rte_cpu_intrinsics intrinsics;
>>>>
>>>> 	rte_cpu_get_intrinsics_support(&intrinsics);
>>>>
>>>> 	if (!intrinsics.power_monitor) {
>>>> 		// rte_power_monitor not supported and cannot be used
>>>> 		return;
>>>> 	}
>>>
>>> This check could be done inside the rte_power API.
>>
>> I'm not quite clear on exactly what you're asking here.
>>
>> Do you mean the example code above? If so, code like that is already
>> present in the power library, at the callback enable stage.
>>
>> If you mean to say, i should put this check into the rte_power_monitor
>> intrinsic, then no, i don't think it's a good idea to have this
>> expensive check every time you call rte_power_monitor.
> 
> No but it can be done at initialization time.
> According to what you say above, it is alread done at callback enable stage.
> So the app does not need to do the check?

Admittedly it's a bit confusing, but please bear with me.

There are two separate issues at hand: the intrinsic itself, and the 
calling code. We provide both.

That means, the *calling code* should do the check. In our case, *our* 
calling code is the callback. However, nothing stops someone else from 
implementing their own scheme using our intrinsic - in that case, the 
user will be responsible to check if the intrinsic is supported before 
using it in their own code, because they won't be using our callback but 
will be using our intrinsic.

So, we have a check *in our calling code*. But if someone were to use 
the *intrinsic* directly (like DLB), they would have to add their own 
checks around the intrinsic usage.

Our power intrinsic is a static inline function. Are you proposing to 
add some sort of function pointer wrapper and make it an indirect call 
instead of a static inline function? (or indeed a proper function)

> 
>> If you mean put this entire infrastructure into the power API - well,
>> that kinda defeats the purpose of both having these intrinsics in
>> generic headers and having a generic CPU feature check infrastructure
>> that was requested of us during the review. We of course can move the
>> intrinsic to the power library and outside of EAL, but then anything
>> that requires UMWAIT will have to depend on the librte_power.
> 
> Yes the intrinsics can be in EAL if usable.
> But it seems DLB author cannot use what is in EAL.

I'll let the DLB authors clarify that themselves, but as far as i'm 
aware, it seems that this is not the case - while their current code 
wouldn't be able to use these intrinsics by search-and-replace, they 
will be able to use them with a couple of changes to their code that 
basically amounted to reimplementation of our intrinsics.

> 
>> Please clarify exactly what changes you would like to see here, and what
>> is your objection.
>>
>>>
>>>> 	// proceed with rte_power_monitor usage
>>>>
>>>> Failing to do that will result in either -ENOTSUP on non-x86, or illegal
>>>> instruction crash on x86 that doesn't have that instruction (because we
>>>> encode raw opcode).
>>>>
>>>> I've *not* added this to the previous patches because i wanted to get
>>>> this part reviewed specifically, and not mix it with other IA-specific
>>>> stuff. It seems that i've succeeded in that goal, as this patch has 4
>>>> likes^W acks :)
>>>
>>> You did not explain the need for rte_cpu_get_features() call.
>>>
>>
>> Did not explain *where*? Are you suggesting i put things about
>> rte_power_monitor into documentation for rte_cpu_get_intrinsics_support?
>> The documentation for rte_power_monitor already states that one should
>> use rte_cpu_get_intrinsics_support API to check if the rte_power_monitor
>> is supported on current machine. What else is missing?
> 
> In your example above, you do not call rte_cpu_get_features()
> which is documented as required in the EAL doc.
> 

I'm not sure i follow. This is unrelated to rte_cpu_get_features call. 
The rte_cpu_get_features is a CPUID check, and it was decided not to use 
it because the WAITPKG CPUID flag is only defined for x86 and not for 
other archs. This new call (rte_cpu_get_intrinsics_support) is non-arch 
specific, but will have an arch-specific implementation (which happens 
to use rte_cpu_get_features to detect support for WAITPKG). I have given 
the example code of how to detect support for rte_power_monitor using 
this new code, in the code example you just referred to.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure
  2020-10-30 15:27                             ` Burakov, Anatoly
@ 2020-10-30 15:44                               ` Thomas Monjalon
  2020-10-30 16:36                                 ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-30 15:44 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: David Marchand, Liang Ma, dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Bruce Richardson, Ananyev, Konstantin, David Hunt,
	Jerin Jacob, Neil Horman, Timothy McDaniel, Gage Eads,
	Marcin Wojtas, Guy Tzalik, Ajit Khaparde, Harman Kalra,
	John Daley, Wei Hu (Xavier),
	Ziyang Xuan, Matan Azrad, Yong Wang, Jerin Jacob, Jan Viktorin,
	David Christensen, Ray Kinsella

30/10/2020 16:27, Burakov, Anatoly:
> On 30-Oct-20 2:09 PM, Thomas Monjalon wrote:
> > 30/10/2020 14:37, Burakov, Anatoly:
> >> On 30-Oct-20 10:14 AM, Thomas Monjalon wrote:
> >>> 30/10/2020 11:09, Burakov, Anatoly:
> >>>> On 29-Oct-20 9:27 PM, David Marchand wrote:
> >>>>> On Tue, Oct 27, 2020 at 4:00 PM Liang Ma <liang.j.ma@intel.com> wrote:
> >>>>>>
> >>>>>> Currently, it is not possible to check support for intrinsics that
> >>>>>> are platform-specific, cannot be abstracted in a generic way, or do not
> >>>>>> have support on all architectures. The CPUID flags can be used to some
> >>>>>> extent, but they are only defined for their platform, while intrinsics
> >>>>>> will be available to all code as they are in generic headers.
> >>>>>>
> >>>>>> This patch introduces infrastructure to check support for certain
> >>>>>> platform-specific intrinsics, and adds support for checking support for
> >>>>>> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
> >>>>>>
> >>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >>>>>> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
> >>>>>> Acked-by: Jerin Jacob <jerinj@marvell.com>
> >>>>>> Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>>>> Acked-by: Ray Kinsella <mdr@ashroe.eu>
> >>>>>
> >>>>> Coming late to the party, it seems crowded...
> >>>>>
> >>>>>
> >>>>>
> >>>>>> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
> >>>>>> index 872f0ebe3e..28a5aecde8 100644
> >>>>>> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
> >>>>>> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
> >>>>>> @@ -13,6 +13,32 @@
> >>>>>>     #include "rte_common.h"
> >>>>>>     #include <errno.h>
> >>>>>>
> >>>>>> +#include <rte_compat.h>
> >>>>>> +
> >>>>>> +/**
> >>>>>> + * Structure used to describe platform-specific intrinsics that may or may not
> >>>>>> + * be supported at runtime.
> >>>>>> + */
> >>>>>> +struct rte_cpu_intrinsics {
> >>>>>> +       uint32_t power_monitor : 1;
> >>>>>> +       /**< indicates support for rte_power_monitor function */
> >>>>>> +       uint32_t power_pause : 1;
> >>>>>> +       /**< indicates support for rte_power_pause function */
> >>>>>> +};
> >>>>>
> >>>>> - The rte_power library is supposed to be built on top of cpuflags.
> >>>>> Not the other way around.
> >>>>> Those capabilities should have been kept inside the rte_power_ API and
> >>>>> not pollute the cpuflags API.
> >>>>>
> >>>>> - All of this should have come as a single patch as the previously
> >>>>> introduced API is unusable before.
> >>>>>
> >>>>>
> >>>>>> +
> >>>>>> +/**
> >>>>>> + * @warning
> >>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
> >>>>>> + *
> >>>>>> + * Check CPU support for various intrinsics at runtime.
> >>>>>> + *
> >>>>>> + * @param intrinsics
> >>>>>> + *     Pointer to a structure to be filled.
> >>>>>> + */
> >>>>>> +__rte_experimental
> >>>>>> +void
> >>>>>> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
> >>>>>> +
> >>>>>>     /**
> >>>>>>      * Enumeration of all CPU features supported
> >>>>>>      */
> >>>>>> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> >>>>>> index fb897d9060..03a326f076 100644
> >>>>>> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
> >>>>>> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> >>>>>> @@ -32,6 +32,10 @@
> >>>>>>      * checked against the expected value, and if they match, the entering of
> >>>>>>      * optimized power state may be aborted.
> >>>>>>      *
> >>>>>> + * @warning It is responsibility of the user to check if this function is
> >>>>>> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> >>>>>> + *   so may result in an illegal CPU instruction error.
> >>>>>> + *
> >>>>>
> >>>>> - Reading this API description... what am I supposed to do in my
> >>>>> application or driver who wants to use the new
> >>>>> rte_power_monitor/rte_power_pause stuff?
> >>>>>
> >>>>> I should call rte_cpu_get_features(TOTO) ?
> >>>>> This comment does not give a hint.
> >>>>>
> >>>>> I suppose the intent was to refer to the rte_cpu_get_intrinsics_support() thing.
> >>>>> This must be fixed.
> >>>>>
> >>>>>
> >>>>> - Again, I wonder why we are exposing all this stuff.
> >>>>> This should be hidden in the rte_power API.
> >>>>>
> >>>>
> >>>> We're exposing all of this here because the intrinsics are *not* part of
> >>>> the power API but rather are generic headers within EAL. Therefore, any
> >>>> infrastructure checking for their support can *not* reside in the power
> >>>> library, but instead has to be in EAL.
> >>>>
> >>>> The intended usage here is to call this function before calling
> >>>> rte_power_monitor(), such that:
> >>>>
> >>>> 	struct rte_cpu_intrinsics intrinsics;
> >>>>
> >>>> 	rte_cpu_get_intrinsics_support(&intrinsics);
> >>>>
> >>>> 	if (!intrinsics.power_monitor) {
> >>>> 		// rte_power_monitor not supported and cannot be used
> >>>> 		return;
> >>>> 	}
> >>>
> >>> This check could be done inside the rte_power API.
> >>
> >> I'm not quite clear on exactly what you're asking here.
> >>
> >> Do you mean the example code above? If so, code like that is already
> >> present in the power library, at the callback enable stage.
> >>
> >> If you mean to say, i should put this check into the rte_power_monitor
> >> intrinsic, then no, i don't think it's a good idea to have this
> >> expensive check every time you call rte_power_monitor.
> > 
> > No but it can be done at initialization time.
> > According to what you say above, it is alread done at callback enable stage.
> > So the app does not need to do the check?
> 
> Admittedly it's a bit confusing, but please bear with me.
> 
> There are two separate issues at hand: the intrinsic itself, and the 
> calling code. We provide both.
> 
> That means, the *calling code* should do the check. In our case, *our* 
> calling code is the callback. However, nothing stops someone else from 
> implementing their own scheme using our intrinsic - in that case, the 
> user will be responsible to check if the intrinsic is supported before 
> using it in their own code, because they won't be using our callback but 
> will be using our intrinsic.
> 
> So, we have a check *in our calling code*. But if someone were to use 
> the *intrinsic* directly (like DLB), they would have to add their own 
> checks around the intrinsic usage.
> 
> Our power intrinsic is a static inline function. Are you proposing to 
> add some sort of function pointer wrapper and make it an indirect call 
> instead of a static inline function? (or indeed a proper function)
> 
> > 
> >> If you mean put this entire infrastructure into the power API - well,
> >> that kinda defeats the purpose of both having these intrinsics in
> >> generic headers and having a generic CPU feature check infrastructure
> >> that was requested of us during the review. We of course can move the
> >> intrinsic to the power library and outside of EAL, but then anything
> >> that requires UMWAIT will have to depend on the librte_power.
> > 
> > Yes the intrinsics can be in EAL if usable.
> > But it seems DLB author cannot use what is in EAL.
> 
> I'll let the DLB authors clarify that themselves, but as far as i'm 
> aware, it seems that this is not the case - while their current code 
> wouldn't be able to use these intrinsics by search-and-replace, they 
> will be able to use them with a couple of changes to their code that 
> basically amounted to reimplementation of our intrinsics.
> 
> > 
> >> Please clarify exactly what changes you would like to see here, and what
> >> is your objection.
> >>
> >>>
> >>>> 	// proceed with rte_power_monitor usage
> >>>>
> >>>> Failing to do that will result in either -ENOTSUP on non-x86, or illegal
> >>>> instruction crash on x86 that doesn't have that instruction (because we
> >>>> encode raw opcode).
> >>>>
> >>>> I've *not* added this to the previous patches because i wanted to get
> >>>> this part reviewed specifically, and not mix it with other IA-specific
> >>>> stuff. It seems that i've succeeded in that goal, as this patch has 4
> >>>> likes^W acks :)
> >>>
> >>> You did not explain the need for rte_cpu_get_features() call.
> >>>
> >>
> >> Did not explain *where*? Are you suggesting i put things about
> >> rte_power_monitor into documentation for rte_cpu_get_intrinsics_support?
> >> The documentation for rte_power_monitor already states that one should
> >> use rte_cpu_get_intrinsics_support API to check if the rte_power_monitor
> >> is supported on current machine. What else is missing?
> > 
> > In your example above, you do not call rte_cpu_get_features()
> > which is documented as required in the EAL doc.
> > 
> 
> I'm not sure i follow. This is unrelated to rte_cpu_get_features call. 
> The rte_cpu_get_features is a CPUID check, and it was decided not to use 
> it because the WAITPKG CPUID flag is only defined for x86 and not for 
> other archs. This new call (rte_cpu_get_intrinsics_support) is non-arch 
> specific, but will have an arch-specific implementation (which happens 
> to use rte_cpu_get_features to detect support for WAITPKG). I have given 
> the example code of how to detect support for rte_power_monitor using 
> this new code, in the code example you just referred to.

Please read the API again:
http://git.dpdk.org/dpdk/tree/lib/librte_eal/include/generic/rte_power_intrinsics.h
"
 * @warning It is responsibility of the user to check if this function is
 *   supported at runtime using `rte_cpu_get_features()` API call.
 *   Failing to do so may result in an illegal CPU instruction error.
"
Why is it referring to rte_cpu_get_features?



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure
  2020-10-30 15:44                               ` Thomas Monjalon
@ 2020-10-30 16:36                                 ` Burakov, Anatoly
  2020-10-30 16:50                                   ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-30 16:36 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: David Marchand, Liang Ma, dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Bruce Richardson, Ananyev, Konstantin, David Hunt,
	Jerin Jacob, Neil Horman, Timothy McDaniel, Gage Eads,
	Marcin Wojtas, Guy Tzalik, Ajit Khaparde, Harman Kalra,
	John Daley, Wei Hu (Xavier),
	Ziyang Xuan, Matan Azrad, Yong Wang, Jerin Jacob, Jan Viktorin,
	David Christensen, Ray Kinsella

On 30-Oct-20 3:44 PM, Thomas Monjalon wrote:
> 30/10/2020 16:27, Burakov, Anatoly:
>> On 30-Oct-20 2:09 PM, Thomas Monjalon wrote:
>>> 30/10/2020 14:37, Burakov, Anatoly:
>>>> On 30-Oct-20 10:14 AM, Thomas Monjalon wrote:
>>>>> 30/10/2020 11:09, Burakov, Anatoly:
>>>>>> On 29-Oct-20 9:27 PM, David Marchand wrote:
>>>>>>> On Tue, Oct 27, 2020 at 4:00 PM Liang Ma <liang.j.ma@intel.com> wrote:
>>>>>>>>
>>>>>>>> Currently, it is not possible to check support for intrinsics that
>>>>>>>> are platform-specific, cannot be abstracted in a generic way, or do not
>>>>>>>> have support on all architectures. The CPUID flags can be used to some
>>>>>>>> extent, but they are only defined for their platform, while intrinsics
>>>>>>>> will be available to all code as they are in generic headers.
>>>>>>>>
>>>>>>>> This patch introduces infrastructure to check support for certain
>>>>>>>> platform-specific intrinsics, and adds support for checking support for
>>>>>>>> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
>>>>>>>>
>>>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>>>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>>>>>>>> Acked-by: David Christensen <drc@linux.vnet.ibm.com>
>>>>>>>> Acked-by: Jerin Jacob <jerinj@marvell.com>
>>>>>>>> Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>>>>>> Acked-by: Ray Kinsella <mdr@ashroe.eu>
>>>>>>>
>>>>>>> Coming late to the party, it seems crowded...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
>>>>>>>> index 872f0ebe3e..28a5aecde8 100644
>>>>>>>> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
>>>>>>>> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
>>>>>>>> @@ -13,6 +13,32 @@
>>>>>>>>      #include "rte_common.h"
>>>>>>>>      #include <errno.h>
>>>>>>>>
>>>>>>>> +#include <rte_compat.h>
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * Structure used to describe platform-specific intrinsics that may or may not
>>>>>>>> + * be supported at runtime.
>>>>>>>> + */
>>>>>>>> +struct rte_cpu_intrinsics {
>>>>>>>> +       uint32_t power_monitor : 1;
>>>>>>>> +       /**< indicates support for rte_power_monitor function */
>>>>>>>> +       uint32_t power_pause : 1;
>>>>>>>> +       /**< indicates support for rte_power_pause function */
>>>>>>>> +};
>>>>>>>
>>>>>>> - The rte_power library is supposed to be built on top of cpuflags.
>>>>>>> Not the other way around.
>>>>>>> Those capabilities should have been kept inside the rte_power_ API and
>>>>>>> not pollute the cpuflags API.
>>>>>>>
>>>>>>> - All of this should have come as a single patch as the previously
>>>>>>> introduced API is unusable before.
>>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * @warning
>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>> + *
>>>>>>>> + * Check CPU support for various intrinsics at runtime.
>>>>>>>> + *
>>>>>>>> + * @param intrinsics
>>>>>>>> + *     Pointer to a structure to be filled.
>>>>>>>> + */
>>>>>>>> +__rte_experimental
>>>>>>>> +void
>>>>>>>> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
>>>>>>>> +
>>>>>>>>      /**
>>>>>>>>       * Enumeration of all CPU features supported
>>>>>>>>       */
>>>>>>>> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>>>>>>> index fb897d9060..03a326f076 100644
>>>>>>>> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>>>>>>> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>>>>>>> @@ -32,6 +32,10 @@
>>>>>>>>       * checked against the expected value, and if they match, the entering of
>>>>>>>>       * optimized power state may be aborted.
>>>>>>>>       *
>>>>>>>> + * @warning It is responsibility of the user to check if this function is
>>>>>>>> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
>>>>>>>> + *   so may result in an illegal CPU instruction error.
>>>>>>>> + *
>>>>>>>
>>>>>>> - Reading this API description... what am I supposed to do in my
>>>>>>> application or driver who wants to use the new
>>>>>>> rte_power_monitor/rte_power_pause stuff?
>>>>>>>
>>>>>>> I should call rte_cpu_get_features(TOTO) ?
>>>>>>> This comment does not give a hint.
>>>>>>>
>>>>>>> I suppose the intent was to refer to the rte_cpu_get_intrinsics_support() thing.
>>>>>>> This must be fixed.
>>>>>>>
>>>>>>>
>>>>>>> - Again, I wonder why we are exposing all this stuff.
>>>>>>> This should be hidden in the rte_power API.
>>>>>>>
>>>>>>
>>>>>> We're exposing all of this here because the intrinsics are *not* part of
>>>>>> the power API but rather are generic headers within EAL. Therefore, any
>>>>>> infrastructure checking for their support can *not* reside in the power
>>>>>> library, but instead has to be in EAL.
>>>>>>
>>>>>> The intended usage here is to call this function before calling
>>>>>> rte_power_monitor(), such that:
>>>>>>
>>>>>> 	struct rte_cpu_intrinsics intrinsics;
>>>>>>
>>>>>> 	rte_cpu_get_intrinsics_support(&intrinsics);
>>>>>>
>>>>>> 	if (!intrinsics.power_monitor) {
>>>>>> 		// rte_power_monitor not supported and cannot be used
>>>>>> 		return;
>>>>>> 	}
>>>>>
>>>>> This check could be done inside the rte_power API.
>>>>
>>>> I'm not quite clear on exactly what you're asking here.
>>>>
>>>> Do you mean the example code above? If so, code like that is already
>>>> present in the power library, at the callback enable stage.
>>>>
>>>> If you mean to say, i should put this check into the rte_power_monitor
>>>> intrinsic, then no, i don't think it's a good idea to have this
>>>> expensive check every time you call rte_power_monitor.
>>>
>>> No but it can be done at initialization time.
>>> According to what you say above, it is alread done at callback enable stage.
>>> So the app does not need to do the check?
>>
>> Admittedly it's a bit confusing, but please bear with me.
>>
>> There are two separate issues at hand: the intrinsic itself, and the
>> calling code. We provide both.
>>
>> That means, the *calling code* should do the check. In our case, *our*
>> calling code is the callback. However, nothing stops someone else from
>> implementing their own scheme using our intrinsic - in that case, the
>> user will be responsible to check if the intrinsic is supported before
>> using it in their own code, because they won't be using our callback but
>> will be using our intrinsic.
>>
>> So, we have a check *in our calling code*. But if someone were to use
>> the *intrinsic* directly (like DLB), they would have to add their own
>> checks around the intrinsic usage.
>>
>> Our power intrinsic is a static inline function. Are you proposing to
>> add some sort of function pointer wrapper and make it an indirect call
>> instead of a static inline function? (or indeed a proper function)
>>
>>>
>>>> If you mean put this entire infrastructure into the power API - well,
>>>> that kinda defeats the purpose of both having these intrinsics in
>>>> generic headers and having a generic CPU feature check infrastructure
>>>> that was requested of us during the review. We of course can move the
>>>> intrinsic to the power library and outside of EAL, but then anything
>>>> that requires UMWAIT will have to depend on the librte_power.
>>>
>>> Yes the intrinsics can be in EAL if usable.
>>> But it seems DLB author cannot use what is in EAL.
>>
>> I'll let the DLB authors clarify that themselves, but as far as i'm
>> aware, it seems that this is not the case - while their current code
>> wouldn't be able to use these intrinsics by search-and-replace, they
>> will be able to use them with a couple of changes to their code that
>> basically amounted to reimplementation of our intrinsics.
>>
>>>
>>>> Please clarify exactly what changes you would like to see here, and what
>>>> is your objection.
>>>>
>>>>>
>>>>>> 	// proceed with rte_power_monitor usage
>>>>>>
>>>>>> Failing to do that will result in either -ENOTSUP on non-x86, or illegal
>>>>>> instruction crash on x86 that doesn't have that instruction (because we
>>>>>> encode raw opcode).
>>>>>>
>>>>>> I've *not* added this to the previous patches because i wanted to get
>>>>>> this part reviewed specifically, and not mix it with other IA-specific
>>>>>> stuff. It seems that i've succeeded in that goal, as this patch has 4
>>>>>> likes^W acks :)
>>>>>
>>>>> You did not explain the need for rte_cpu_get_features() call.
>>>>>
>>>>
>>>> Did not explain *where*? Are you suggesting i put things about
>>>> rte_power_monitor into documentation for rte_cpu_get_intrinsics_support?
>>>> The documentation for rte_power_monitor already states that one should
>>>> use rte_cpu_get_intrinsics_support API to check if the rte_power_monitor
>>>> is supported on current machine. What else is missing?
>>>
>>> In your example above, you do not call rte_cpu_get_features()
>>> which is documented as required in the EAL doc.
>>>
>>
>> I'm not sure i follow. This is unrelated to rte_cpu_get_features call.
>> The rte_cpu_get_features is a CPUID check, and it was decided not to use
>> it because the WAITPKG CPUID flag is only defined for x86 and not for
>> other archs. This new call (rte_cpu_get_intrinsics_support) is non-arch
>> specific, but will have an arch-specific implementation (which happens
>> to use rte_cpu_get_features to detect support for WAITPKG). I have given
>> the example code of how to detect support for rte_power_monitor using
>> this new code, in the code example you just referred to.
> 
> Please read the API again:
> http://git.dpdk.org/dpdk/tree/lib/librte_eal/include/generic/rte_power_intrinsics.h
> "
>   * @warning It is responsibility of the user to check if this function is
>   *   supported at runtime using `rte_cpu_get_features()` API call.
>   *   Failing to do so may result in an illegal CPU instruction error.
> "
> Why is it referring to rte_cpu_get_features?
> 

Aw, crap. You're right, it's an artifact of earlier implementation. 
We'll fix this. Thanks for letting us know!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure
  2020-10-30 16:36                                 ` Burakov, Anatoly
@ 2020-10-30 16:50                                   ` Thomas Monjalon
  0 siblings, 0 replies; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-30 16:50 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: David Marchand, Liang Ma, dev,
	Ruifeng Wang (Arm Technology China),
	Wang, Haiyue, Bruce Richardson, Ananyev, Konstantin, David Hunt,
	Jerin Jacob, Neil Horman, Timothy McDaniel, Gage Eads,
	Marcin Wojtas, Guy Tzalik, Ajit Khaparde, Harman Kalra,
	John Daley, Wei Hu (Xavier),
	Ziyang Xuan, Matan Azrad, Yong Wang, Jerin Jacob, Jan Viktorin,
	David Christensen, Ray Kinsella, john.mcnamara

30/10/2020 17:36, Burakov, Anatoly:
> On 30-Oct-20 3:44 PM, Thomas Monjalon wrote:
> > 30/10/2020 16:27, Burakov, Anatoly:
> >> On 30-Oct-20 2:09 PM, Thomas Monjalon wrote:
> >>> 30/10/2020 14:37, Burakov, Anatoly:
> >>>> On 30-Oct-20 10:14 AM, Thomas Monjalon wrote:
> >>>>> 30/10/2020 11:09, Burakov, Anatoly:
> >>>>>> The intended usage here is to call this function before calling
> >>>>>> rte_power_monitor(), such that:
> >>>>>>
> >>>>>> 	struct rte_cpu_intrinsics intrinsics;
> >>>>>>
> >>>>>> 	rte_cpu_get_intrinsics_support(&intrinsics);
> >>>>>>
> >>>>>> 	if (!intrinsics.power_monitor) {
> >>>>>> 		// rte_power_monitor not supported and cannot be used
> >>>>>> 		return;
> >>>>>> 	}
> >>>>>
[...]
> >>> In your example above, you do not call rte_cpu_get_features()
> >>> which is documented as required in the EAL doc.
> >>>
> >>
> >> I'm not sure i follow. This is unrelated to rte_cpu_get_features call.
> >> The rte_cpu_get_features is a CPUID check, and it was decided not to use
> >> it because the WAITPKG CPUID flag is only defined for x86 and not for
> >> other archs. This new call (rte_cpu_get_intrinsics_support) is non-arch
> >> specific, but will have an arch-specific implementation (which happens
> >> to use rte_cpu_get_features to detect support for WAITPKG). I have given
> >> the example code of how to detect support for rte_power_monitor using
> >> this new code, in the code example you just referred to.
> > 
> > Please read the API again:
> > http://git.dpdk.org/dpdk/tree/lib/librte_eal/include/generic/rte_power_intrinsics.h
> > "
> >   * @warning It is responsibility of the user to check if this function is
> >   *   supported at runtime using `rte_cpu_get_features()` API call.
> >   *   Failing to do so may result in an illegal CPU instruction error.
> > "
> > Why is it referring to rte_cpu_get_features?
> 
> Aw, crap. You're right, it's an artifact of earlier implementation. 
> We'll fix this. Thanks for letting us know!

This is what happens when everybody pushes me to merge a patch
that I believe not ready but with a lot of "acked but not reviewed".

The context around this patch series is not good to allow good quality.
That's why I think we should not merge any more patch on top of it
except DLB PMDs and fixes in this release.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
  2020-05-28 11:39   ` Ananyev, Konstantin
@ 2020-11-02 11:09   ` Liang Ma
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API Liang Ma
                       ` (17 more replies)
  1 sibling, 18 replies; 421+ messages in thread
From: Liang Ma @ 2020-11-02 11:09 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan,
	yongwang

This patchset proposes a simple API for Ethernet drivers
to cause the CPU to enter a power-optimized state while
waiting for packets to arrive, along with a set of
generic intrinsics that facilitate that. This is achieved
through cooperation with the NIC driver that will allow
us to know address of wake up event, and wait for writes
on it.

On IA, this is achieved through using UMONITOR/UMWAIT
instructions. They are used in their raw opcode form
because there is no widespread compiler support for
them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen
to implement similar instructions.

To achieve power savings, there is a very simple mechanism
used: we're counting empty polls, and if a certain threshold
is reached, we get the address of next RX ring descriptor
from the NIC driver, arm the monitoring hardware, and
enter a power-optimized state. We will then wake up when
either a timeout happens, or a write happens (or generally
whenever CPU feels like waking up - this is platform-
specific), and proceed as normal. The empty poll counter is
reset whenever we actually get packets, so we only go to
sleep when we know nothing is going on. The mechanism is
generic which can be used for any write back descriptor.

Why are we putting it into ethdev as opposed to leaving
this up to the application? Our customers specifically
requested a way to do it wit minimal changes to the
application code. The current approach allows to just
flip a switch and automatically have power savings.

- Only 1:1 core to queue mapping is supported,
  meaning that each lcore must at most handle RX on a
  single queue
- Support 3 type policies. UMWAIT/PAUSE/Frequency_Scale
- Power management is enabled per-queue
- The API doesn't extend to other device types

Liang Ma (6):
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  51 +++
 doc/guides/rel_notes/release_20_11.rst        |  17 +
 .../sample_app_ug/l3_forward_power_man.rst    |  14 +
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  26 ++
 drivers/net/i40e/i40e_rxtx.h                  |   2 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   2 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
 examples/l3fwd-power/main.c                   |  46 ++-
 lib/librte_ethdev/rte_ethdev.c                |  23 ++
 lib/librte_ethdev/rte_ethdev.h                |  41 +++
 lib/librte_ethdev/rte_ethdev_driver.h         |  28 ++
 lib/librte_ethdev/version.map                 |   1 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 320 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  92 +++++
 lib/librte_power/version.map                  |   4 +
 21 files changed, 725 insertions(+), 3 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
@ 2020-11-02 11:10     ` Liang Ma
  2020-11-02 12:23       ` Burakov, Anatoly
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 2/6] power: add PMD power management API and callback Liang Ma
                       ` (16 subsequent siblings)
  17 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan,
	yongwang, Anatoly Burakov, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella

Add a simple API to allow getting address of getting notification
information from the PMD, as well as release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v11:
    - Rework the API Doxygen documentation

    v10:
    - Address minor issue on comments and release notes

    v8:
    - Rename version map file name

    v7:
    - Fixed race condition (Konstantin)
    - Slight rework of the structure of monitor code
    - Added missing inline for wakeup

    v6:
    - Added wakeup mechanism for UMWAIT
    - Removed memory allocation (everything is now allocated statically)
    - Fixed various typos and comments
    - Check for invalid queue ID
    - Moved release notes to this patch

    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
    - Add some debug logging
    - Replace x86-specific code path to generic path using the intrinsic check
---
 doc/guides/rel_notes/release_20_11.rst |  4 +++
 lib/librte_ethdev/rte_ethdev.c         | 23 +++++++++++++++
 lib/librte_ethdev/rte_ethdev.h         | 41 ++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h  | 28 ++++++++++++++++++
 lib/librte_ethdev/version.map          |  1 +
 5 files changed, 97 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 88b9086390..e95e6aa7a5 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -148,6 +148,10 @@ New Features
   Hairpin Tx part flow rules can be inserted explicitly.
   New API is added to get the hairpin peer ports list.
 
+* **ethdev: added 1 new API for PMD power management.**
+
+  * ``rte_eth_get_wake_addr()`` is added to get the wake up address from device.
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index b12bb3854d..4f3115fe8e 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -5138,6 +5138,29 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_wake_addr(dev->data->rx_queues[queue_id],
+			wake_addr, expected, mask, data_sz));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index e341a08817..d208fe99ca 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4364,6 +4364,47 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * In order to make use of some PMD Power Management schemes, the user might
+ * want to wait (sleep/poll) until new packets arrive. This function
+ * retrieves the necessary information from the PMD to enter that wait/sleep
+ * state. The main parameter provided is the address to monitor while waiting
+ * to wake up. In addition to this wake address, the function also provides
+ * extra information including expected value, selection mask and data size to
+ * monitor. The user is expected to use this information to enter low power
+ * mode using the rte_power_monitor API, and the core will exit low power mode
+ * upon reaching the expected condition:
+ * (((uint64_t)read_mem(wake_addr, data_sz)) & mask) == expected).
+ * @note The low power mode can also exit in other cases, e.g. interrupt.
+ *
+ * @param[in] port_id
+ *  The port identifier of the Ethernet device.
+ * @param[in] queue_id
+ *  The Rx queue on the Ethernet device for which information will be
+ *  retrieved.
+ * @param[out] wake_addr
+ *  The pointer to the address which will be monitored.
+ * @param[out] expected
+ *  The pointer to the expected value to allow wakeup condition.
+ * @param[out] mask
+ *  The pointer to comparison bitmask for the expected value.
+ * @note a mask value of zero should be treated as:
+ *  “no special wakeup values for provided address from the driver”.
+ * @param[out] data_sz
+ *  The pointer to data size for the expected value (in bytes)
+ * @note valid values are 1,2,4,8
+ *
+ * @return
+ *  - 0: Success.
+ *  -ENOTSUP: Operation not supported.
+ *  -EINVAL: Invalid parameters.
+ *  -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
+		uint8_t *data_sz);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index c63b9f7eb7..e7ce1e261d 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -752,6 +752,32 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
 	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
 /**< @internal Unbind peer queue from the current queue. */
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param tail_desc_addr
+ *   The pointer point to where the address will be stored.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ * @param data_sz
+ *   Data size for the expected value (can be 1, 2, 4, or 8 bytes)
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -910,6 +936,8 @@ struct eth_dev_ops {
 	/**< Set up the connection between the pair of hairpin queues. */
 	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
 	/**< Disconnect the hairpin queues of a pair from each other. */
+	eth_get_wake_addr_t get_wake_addr;
+	/**< Get next RX queue ring entry address. */
 };
 
 /**
diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
index 8ddda2547f..f9ce4e3c8d 100644
--- a/lib/librte_ethdev/version.map
+++ b/lib/librte_ethdev/version.map
@@ -235,6 +235,7 @@ EXPERIMENTAL {
 	rte_eth_fec_get_capability;
 	rte_eth_fec_get;
 	rte_eth_fec_set;
+	rte_eth_get_wake_addr;
 	rte_flow_shared_action_create;
 	rte_flow_shared_action_destroy;
 	rte_flow_shared_action_query;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v11 2/6] power: add PMD power management API and callback
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API Liang Ma
@ 2020-11-02 11:10     ` Liang Ma
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 3/6] net/ixgbe: implement power management API Liang Ma
                       ` (15 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan,
	yongwang, Anatoly Burakov, Ray Kinsella

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

The following are the available schemes:

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. Pause instruction

   Instead of move the core into deeper C state, this method uses the
   pause instruction to avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v11:
    - Updated power library document
    - Updated release notes

    v10:
    - Updated power library document

    v8:
    - Rename version map file name

    v7:
    - Fixed race condition (Konstantin)
    - Slight rework of the structure of monitor code
    - Added missing inline for wakeup

    v6:
    - Added wakeup mechanism for UMWAIT
    - Removed memory allocation (everything is now allocated statically)
    - Fixed various typos and comments
    - Check for invalid queue ID
    - Moved release notes to this patch

    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support
        get_wake_addr
    - Add some debug logging
    - Replace x86-specific code path to generic path using the
      intrinsic check
---
 doc/guides/prog_guide/power_man.rst    |  51 ++++
 doc/guides/rel_notes/release_20_11.rst |  15 +-
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 320 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  92 +++++++
 lib/librte_power/version.map           |   4 +
 6 files changed, 484 insertions(+), 3 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..380e7aace7 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,54 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+
+Existing Power Management mechanisms require developers to change the design of
+an application or change code to make use of it. The PMD Power Management API
+provides a convenient alternative by use Ethernet PMD RX callbacks, and
+triggering power saving whenever the empty poll count reaches a certain number.
+
+There are multiple power saving schemes available for the developer to choose.
+Although the developer can configure each queue with different scheme, It's
+strongly recommended to configure the queue within the same port with the same
+scheme.
+
+The following are the available schemes:
+
+  * UMWAIT/UMONITOR
+
+   This power saving scheme will put the core into an  optimized power state and
+   use the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX
+   descriptor address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will use the "rte_pause" function to reduce the impact
+   of busy polling.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing power library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+
+.. note::
+
+   Currently, this Power Management API is limited to  mapping of 1 queue to 1
+   core (multiple queues are supported, but they must be polled from different
+   cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+* **Queue Enable**: Enable a specific power scheme for a certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +248,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index e95e6aa7a5..0430fca9cc 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -150,7 +150,9 @@ New Features
 
 * **ethdev: added 1 new API for PMD power management.**
 
-  * ``rte_eth_get_wake_addr()`` is added to get the wake up address from device.
+  * ``rte_eth_get_wake_addr()`` function has been added to allow applications to
+    fetch the wake up information from the device. Processor need that information
+    to wake up from the low power state.
 
 * **Updated Broadcom bnxt driver.**
 
@@ -362,6 +364,17 @@ New Features
   * Replaced ``--scalar`` command-line option with ``--alg=<value>``, to allow
     the user to select the desired classify method.
 
+* **Added PMD power management mechanism**
+
+  The new Ethernet PMD Power Management mechanisms have been added through the
+  existing RX callback infrastructure.
+
+  * Added power saving scheme based on UMWAIT instruction (x86 only)
+  * Added power saving scheme based on ``rte_pause()``
+  * Added power saving scheme based on frequency scaling through the power library
+  * Added new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Added new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..0dcaddc3bd
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,320 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	rte_spinlock_t umwait_lock;
+	/**< Per-queue status lock - used only for UMWAIT mode */
+	volatile void *wait_addr;
+	/**< UMWAIT wakeup address */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+/* trigger a write to the cache line we're waiting on */
+static inline void
+umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
+static inline void
+umwait_sleep(struct pmd_queue_cfg *q_conf, uint16_t port_id, uint16_t qidx)
+{
+	volatile void *target_addr;
+	uint64_t expected, mask;
+	uint8_t data_sz;
+	uint16_t ret;
+
+	/*
+	 * get wake up address for this RX queue, as well as expected value,
+	 * comparison mask, and data size.
+	 */
+	ret = rte_eth_get_wake_addr(port_id, qidx, &target_addr,
+			&expected, &mask, &data_sz);
+
+	/* this should always succeed as all checks have been done already */
+	if (unlikely(ret != 0))
+		return;
+
+	/*
+	 * take out a spinlock to prevent control plane from concurrently
+	 * modifying the wakeup data.
+	 */
+	rte_spinlock_lock(&q_conf->umwait_lock);
+
+	/* have we been disabled by control plane? */
+	if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		/* we're good to go */
+
+		/*
+		 * store the wakeup address so that control plane can trigger a
+		 * write to this address and wake us up.
+		 */
+		q_conf->wait_addr = target_addr;
+		/* -1ULL is maximum value for TSC */
+		rte_power_monitor_sync(target_addr, expected, mask, -1ULL,
+				data_sz, &q_conf->umwait_lock);
+		/* erase the address */
+		q_conf->wait_addr = NULL;
+	}
+	rte_spinlock_unlock(&q_conf->umwait_lock);
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			umwait_sleep(q_conf, port_id, qidx);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			rte_delay_us(1);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* check if queue id is valid */
+	if (queue_id >= dev->data->nb_rx_queues ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		return -EINVAL;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	{
+		/* check if rte_power_monitor is supported */
+		uint64_t dummy_expected, dummy_mask;
+		struct rte_cpu_intrinsics i;
+		volatile void *dummy_addr;
+		uint8_t dummy_sz;
+
+		rte_cpu_get_intrinsics_support(&i);
+
+		if (!i.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_wake_addr(port_id, queue_id,
+				&dummy_addr, &dummy_expected,
+				&dummy_mask, &dummy_sz) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize UMWAIT spinlock */
+		rte_spinlock_init(&queue_cfg->umwait_lock);
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+		rte_spinlock_lock(&queue_cfg->umwait_lock);
+
+		/* wake up the core from UMWAIT sleep, if any */
+		if (queue_cfg->wait_addr != NULL)
+			umwait_wakeup(queue_cfg->wait_addr);
+		/*
+		 * we need to disable early as there might be callback currently
+		 * spinning on a lock.
+		 */
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+		rte_spinlock_unlock(&queue_cfg->umwait_lock);
+		/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	ret = 0;
+end:
+	return ret;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..a7a3f98268
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v11 3/6] net/ixgbe: implement power management API
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API Liang Ma
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 2/6] power: add PMD power management API and callback Liang Ma
@ 2020-11-02 11:10     ` Liang Ma
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 4/6] net/i40e: " Liang Ma
                       ` (14 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan,
	yongwang, Anatoly Burakov, Jeff Guo

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 00101c2eec..fcc4026372 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_wake_addr        = ixgbe_get_wake_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 6cfbb582e2..db94b9d05d 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	*data_sz = 4;
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 6d2f7c9da3..1ef0b05e66 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v11 4/6] net/i40e: implement power management API
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (2 preceding siblings ...)
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 3/6] net/ixgbe: implement power management API Liang Ma
@ 2020-11-02 11:10     ` Liang Ma
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 5/6] net/ice: " Liang Ma
                       ` (13 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan,
	yongwang, Anatoly Burakov, Beilei Xing, Jeff Guo

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 4778aaf299..358a38232b 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_wake_addr	              = i40e_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5df9a9df56..78862fe3a2 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -72,6 +72,32 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	*data_sz = 8;
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..5826cf1099 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value, uint8_t *data_sz);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v11 5/6] net/ice: implement power management API
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (3 preceding siblings ...)
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 4/6] net/i40e: " Liang Ma
@ 2020-11-02 11:10     ` Liang Ma
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 6/6] examples/l3fwd-power: enable PMD power mgmt Liang Ma
                       ` (12 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan,
	yongwang, Anatoly Burakov, Qiming Yang, Qi Zhang

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index c65125ff32..54f185ad4d 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_wake_addr	              = ice_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index ee576c362a..fafd6ada62 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	*data_sz = 2;
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1c23c7541e..7eeb8d467e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v11 6/6] examples/l3fwd-power: enable PMD power mgmt
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (4 preceding siblings ...)
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 5/6] net/ice: " Liang Ma
@ 2020-11-02 11:10     ` Liang Ma
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 00/11] Add PMD power management Anatoly Burakov
                       ` (11 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw)
  To: dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan,
	yongwang, Anatoly Burakov

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v11:
    - Update l3fwd-power documentation

    v8:
    - Add return status check for queue enable

    v6:
    - Fixed typos in documentation
---
 .../sample_app_ug/l3_forward_power_man.rst    | 14 ++++++
 examples/l3fwd-power/main.c                   | 46 ++++++++++++++++++-
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index d7e1dc5813..149342d112 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -455,3 +457,15 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD Power Management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enabling the power saving scheme on a specific
+port/queue/lcore. The main purpose for this mode is to demonstrate how to use the PMD power
+management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index a48d75f68f..aafa415f0b 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,7 +200,8 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
@@ -1750,6 +1752,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1771,6 +1774,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1881,6 +1885,16 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt  mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2437,6 +2451,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2705,6 +2721,17 @@ main(int argc, char **argv)
 			} else if (!check_ptype(portid))
 				rte_exit(EXIT_FAILURE,
 					 "PMD can not provide needed ptypes\n");
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_pmd_mgmt_queue_enable(lcore_id,
+						     portid, queueid,
+						     RTE_POWER_MGMT_TYPE_SCALE);
+
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+					"rte_power_pmd_mgmt enable: err=%d, "
+					"port=%d\n", ret, portid);
+
+			}
 		}
 	}
 
@@ -2790,6 +2817,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL,
+					 CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2816,6 +2846,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API Liang Ma
@ 2020-11-02 12:23       ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-11-02 12:23 UTC (permalink / raw)
  To: Liang Ma, dev
  Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev,
	david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel,
	gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan,
	yongwang, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella

On 02-Nov-20 11:10 AM, Liang Ma wrote:
> Add a simple API to allow getting address of getting notification
> information from the PMD, as well as release notes information.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
> 
> Notes:
>      v11:
>      - Rework the API Doxygen documentation
> 
>      v10:
>      - Address minor issue on comments and release notes
> 
>      v8:
>      - Rename version map file name
> 
>      v7:
>      - Fixed race condition (Konstantin)
>      - Slight rework of the structure of monitor code
>      - Added missing inline for wakeup
> 
>      v6:
>      - Added wakeup mechanism for UMWAIT
>      - Removed memory allocation (everything is now allocated statically)
>      - Fixed various typos and comments
>      - Check for invalid queue ID
>      - Moved release notes to this patch
> 
>      v5:
>      - Make error checking more robust
>        - Prevent initializing scaling if ACPI or PSTATE env wasn't set
>        - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
>      - Add some debug logging
>      - Replace x86-specific code path to generic path using the intrinsic check
> ---

<snip>

>   int
>   rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>   			     struct rte_ether_addr *mc_addr_set,
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index e341a08817..d208fe99ca 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -4364,6 +4364,47 @@ __rte_experimental
>   int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>   	struct rte_eth_burst_mode *mode);
>   
> +/**
> + * In order to make use of some PMD Power Management schemes, the user might
> + * want to wait (sleep/poll) until new packets arrive. This function
> + * retrieves the necessary information from the PMD to enter that wait/sleep
> + * state. The main parameter provided is the address to monitor while waiting
> + * to wake up. In addition to this wake address, the function also provides
> + * extra information including expected value, selection mask and data size to
> + * monitor. The user is expected to use this information to enter low power
> + * mode using the rte_power_monitor API, and the core will exit low power mode
> + * upon reaching the expected condition:
> + * (((uint64_t)read_mem(wake_addr, data_sz)) & mask) == expected).
> + * @note The low power mode can also exit in other cases, e.g. interrupt.

Could we maybe have some paragraphs and spacing here, instead of one 
solid block of text? Suggested formatting:

  * In order to make use of some PMD Power Management schemes, the user 
might
  * want to wait (sleep/poll) until new packets arrive. This function 
retrieves
  * the necessary information from the PMD to enter that wait/sleep state.
  *
  * The main parameter provided is the address to monitor while waiting 
to wake
  * up. In addition to this wake address, the function also provides extra
  * information including expected value, selection mask and data size to
  * monitor.
  *
  * The user is expected to use this information to enter low power mode 
using
  * the rte_power_monitor API, and the core will exit low power mode upon
  * reaching the expected condition:
  *
  *   `(((uint64_t)read_mem(wake_addr, data_sz)) & mask) == expected)`
  *
  * @note The low power mode can also exit in other cases, e.g. interrupt.

> + *
> + * @param[in] port_id
> + *  The port identifier of the Ethernet device.
> + * @param[in] queue_id
> + *  The Rx queue on the Ethernet device for which information will be
> + *  retrieved.
> + * @param[out] wake_addr
> + *  The pointer to the address which will be monitored.
> + * @param[out] expected
> + *  The pointer to the expected value to allow wakeup condition.
> + * @param[out] mask
> + *  The pointer to comparison bitmask for the expected value.
> + * @note a mask value of zero should be treated as:
> + *  “no special wakeup values for provided address from the driver”.

Wrong quotes.

> + * @param[out] data_sz
> + *  The pointer to data size for the expected value (in bytes)
> + * @note valid values are 1,2,4,8

Shouldn't @note be under @param, i.e. indented to the right at the same 
level the text is? Also, everywhere else in this file, the indentation 
of the text is two spaces, not one. So, should be e.g.

* @param[out] data_sz
*   The pointer to data......
*   @note valid values are.....

> + *
> + * @return
> + *  - 0: Success.
> + *  -ENOTSUP: Operation not supported.
> + *  -EINVAL: Invalid parameters.
> + *  -ENODEV: Invalid port ID.
> + */
> +__rte_experimental
> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +		volatile void **wake_addr, uint64_t *expected, uint64_t *mask,
> +		uint8_t *data_sz);
> +
>   /**
>    * Retrieve device registers and register attributes (number of registers and
>    * register size)
> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
> index c63b9f7eb7..e7ce1e261d 100644
> --- a/lib/librte_ethdev/rte_ethdev_driver.h
> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
> @@ -752,6 +752,32 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
>   	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
>   /**< @internal Unbind peer queue from the current queue. */
>   
> +/**
> + * @internal
> + * Get address of memory location whose contents will change whenever there is
> + * new data to be received.
> + *
> + * @param rxq
> + *   Ethdev queue pointer.
> + * @param tail_desc_addr
> + *   The pointer point to where the address will be stored.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + * @param data_sz
> + *   Data size for the expected value (can be 1, 2, 4, or 8 bytes)
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success
> + * @retval -EINVAL
> + *   Invalid parameters
> + */
> +typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *mask, uint8_t *data_sz);
> +
>   /**
>    * @internal A structure containing the functions exported by an Ethernet driver.
>    */
> @@ -910,6 +936,8 @@ struct eth_dev_ops {
>   	/**< Set up the connection between the pair of hairpin queues. */
>   	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
>   	/**< Disconnect the hairpin queues of a pair from each other. */
> +	eth_get_wake_addr_t get_wake_addr;
> +	/**< Get next RX queue ring entry address. */
>   };
>   
>   /**
> diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
> index 8ddda2547f..f9ce4e3c8d 100644
> --- a/lib/librte_ethdev/version.map
> +++ b/lib/librte_ethdev/version.map
> @@ -235,6 +235,7 @@ EXPERIMENTAL {
>   	rte_eth_fec_get_capability;
>   	rte_eth_fec_get;
>   	rte_eth_fec_set;
> +	rte_eth_get_wake_addr;
>   	rte_flow_shared_action_create;
>   	rte_flow_shared_action_destroy;
>   	rte_flow_shared_action_query;
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 00/11] Add PMD power management
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (5 preceding siblings ...)
  2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 6/6] examples/l3fwd-power: enable PMD power mgmt Liang Ma
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-17 16:12       ` David Marchand
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 01/11] eal: uninline power intrinsics Anatoly Burakov
                       ` (10 subsequent siblings)
  17 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: thomas, konstantin.ananyev, gage.eads, timothy.mcdaniel,
	david.hunt, bruce.richardson, chris.macnamara

This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. This is achieved through cooperation with the NIC driver that 
will allow us to know address of wake up event, and wait for writes on 
it.

On IA, this is achieved through using UMONITOR/UMWAIT instructions. They 
are used in their raw opcode form because there is no widespread 
compiler support for them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen to implement 
similar instructions.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we get the 
address of next RX ring descriptor from the NIC driver, arm the 
monitoring hardware, and enter a power-optimized state. We will then 
wake up when either a timeout happens, or a write happens (or generally 
whenever CPU feels like waking up - this is platform-specific), and  
proceed as normal. The empty poll counter is reset whenever we actually 
get packets, so we only go to sleep when we know nothing is going on. 
The mechanism is generic which can be used for any write back 
descriptor.

This patchset also introduces a few changes into existing power 
management-related intrinsics, namely to provide a native way of waking 
up a sleeping core without application being responsible for it, as well 
as general robustness improvements. There's quite a bit of locking going 
on, but these locks are per-thread and very little (if any) contention 
is expected, so the performance impact shouldn't be that bad (and in any 
case the locking happens when we're about to sleep anyway, not on a 
hotpath).

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it wit 
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

Anatoly Burakov (5):
  eal: uninline power intrinsics
  eal: avoid invalid API usage in power intrinsics
  eal: change API of power intrinsics
  eal: remove sync version of power monitor
  eal: add monitor wakeup function

Liang Ma (6):
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  44 +++
 doc/guides/rel_notes/release_21_02.rst        |  14 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 drivers/event/dlb/dlb.c                       |  10 +-
 drivers/event/dlb2/dlb2.c                     |  10 +-
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  25 ++
 drivers/net/i40e/i40e_rxtx.h                  |   1 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   1 +
 examples/l3fwd-power/main.c                   |  89 ++++-
 .../arm/include/rte_power_intrinsics.h        |  39 +-
 .../include/generic/rte_power_intrinsics.h    |  78 ++--
 .../ppc/include/rte_power_intrinsics.h        |  39 +-
 lib/librte_eal/version.map                    |   5 +
 .../x86/include/rte_power_intrinsics.h        | 115 ------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 169 +++++++++
 lib/librte_ethdev/rte_ethdev.c                |  28 ++
 lib/librte_ethdev/rte_ethdev.h                |  25 ++
 lib/librte_ethdev/rte_ethdev_driver.h         |  22 ++
 lib/librte_ethdev/version.map                 |   3 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 349 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  90 +++++
 lib/librte_power/version.map                  |   5 +
 30 files changed, 1028 insertions(+), 229 deletions(-)
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 01/11] eal: uninline power intrinsics
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (6 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 00/11] Add PMD power management Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 02/11] eal: avoid invalid API usage in " Anatoly Burakov
                       ` (9 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, power intrinsics are inline functions. Make them part of the
ABI so that we can have various internal data associated with them
without exposing said data to the outside world.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../arm/include/rte_power_intrinsics.h        |   6 +-
 .../include/generic/rte_power_intrinsics.h    |   6 +-
 .../ppc/include/rte_power_intrinsics.h        |   6 +-
 lib/librte_eal/version.map                    |   5 +
 .../x86/include/rte_power_intrinsics.h        | 115 -----------------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 120 ++++++++++++++++++
 7 files changed, 135 insertions(+), 124 deletions(-)
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index a4a1bc1159..5e384d380e 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -16,7 +16,7 @@ extern "C" {
 /**
  * This function is not supported on ARM.
  */
-static inline void
+void
 rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz)
@@ -31,7 +31,7 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 /**
  * This function is not supported on ARM.
  */
-static inline void
+void
 rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz, rte_spinlock_t *lck)
@@ -47,7 +47,7 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 /**
  * This function is not supported on ARM.
  */
-static inline void
+void
 rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index dd520d90fa..67977bd511 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -52,7 +52,7 @@
  *   to undefined result.
  */
 __rte_experimental
-static inline void rte_power_monitor(const volatile void *p,
+void rte_power_monitor(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz);
 
@@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p,
  *   wakes up.
  */
 __rte_experimental
-static inline void rte_power_monitor_sync(const volatile void *p,
+void rte_power_monitor_sync(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz,
 		rte_spinlock_t *lck);
@@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  *   architecture-dependent.
  */
 __rte_experimental
-static inline void rte_power_pause(const uint64_t tsc_timestamp);
+void rte_power_pause(const uint64_t tsc_timestamp);
 
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 4ed03d521f..4cb5560c02 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -16,7 +16,7 @@ extern "C" {
 /**
  * This function is not supported on PPC64.
  */
-static inline void
+void
 rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz)
@@ -31,7 +31,7 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 /**
  * This function is not supported on PPC64.
  */
-static inline void
+void
 rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz, rte_spinlock_t *lck)
@@ -47,7 +47,7 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 /**
  * This function is not supported on PPC64.
  */
-static inline void
+void
 rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 354c068f31..31bf76ae81 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -403,6 +403,11 @@ EXPERIMENTAL {
 	rte_service_lcore_may_be_active;
 	rte_vect_get_max_simd_bitwidth;
 	rte_vect_set_max_simd_bitwidth;
+
+	# added in 21.02
+	rte_power_monitor;
+	rte_power_monitor_sync;
+	rte_power_pause;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
index c7d790c854..e4c2b87f73 100644
--- a/lib/librte_eal/x86/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -13,121 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-static inline uint64_t
-__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz)
-{
-	switch (sz) {
-	case sizeof(uint8_t):
-		return *(const volatile uint8_t *)p;
-	case sizeof(uint16_t):
-		return *(const volatile uint16_t *)p;
-	case sizeof(uint32_t):
-		return *(const volatile uint32_t *)p;
-	case sizeof(uint64_t):
-		return *(const volatile uint64_t *)p;
-	default:
-		/* this is an intrinsic, so we can't have any error handling */
-		RTE_ASSERT(0);
-		return 0;
-	}
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-}
-
-/**
- * This function uses TPAUSE instruction  and will enter C0.2 state. For more
- * information about usage of this instruction, please refer to Intel(R) 64 and
- * IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* execute TPAUSE */
-	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
-		: /* ignore rflags */
-		: "D"(0), /* enter C0.2 */
-		  "a"(tsc_l), "d"(tsc_h));
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build
index e78f29002e..dfd42dee0c 100644
--- a/lib/librte_eal/x86/meson.build
+++ b/lib/librte_eal/x86/meson.build
@@ -8,4 +8,5 @@ sources += files(
 	'rte_cycles.c',
 	'rte_hypervisor.c',
 	'rte_spinlock.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
new file mode 100644
index 0000000000..34c5fd9c3e
--- /dev/null
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -0,0 +1,120 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+static inline uint64_t
+__get_umwait_val(const volatile void *p, const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):
+		return *(const volatile uint8_t *)p;
+	case sizeof(uint16_t):
+		return *(const volatile uint16_t *)p;
+	case sizeof(uint32_t):
+		return *(const volatile uint32_t *)p;
+	case sizeof(uint64_t):
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		RTE_ASSERT(0);
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			"a"(tsc_l), "d"(tsc_h));
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 02/11] eal: avoid invalid API usage in power intrinsics
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (7 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 01/11] eal: uninline power intrinsics Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 03/11] eal: change API of " Anatoly Burakov
                       ` (8 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Bruce Richardson, Konstantin Ananyev, thomas, gage.eads,
	timothy.mcdaniel, david.hunt, chris.macnamara

Currently, the API documentation mandates that if the user wants to use
the power management intrinsics, they need to call the
`rte_cpu_get_intrinsics_support` API and check support for specific
intrinsics.

However, if the user does not do that, it is possible to get illegal
instruction error because we're using raw instruction opcodes, which may
or may not be supported at runtime.

Now that we have everything in a C file, we can check for support at
startup and prevent the user from possibly encountering illegal
instruction errors.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  3 --
 lib/librte_eal/x86/rte_power_intrinsics.c     | 31 +++++++++++++++++--
 2 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 67977bd511..ffa72f7578 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -34,7 +34,6 @@
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -75,7 +74,6 @@ void rte_power_monitor(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -111,7 +109,6 @@ void rte_power_monitor_sync(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 34c5fd9c3e..b48a54ec7f 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,8 @@
 
 #include "rte_power_intrinsics.h"
 
+static uint8_t wait_supported;
+
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
 {
@@ -35,6 +37,11 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -72,6 +79,11 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -112,9 +124,22 @@ rte_power_pause(const uint64_t tsc_timestamp)
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
 
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/* execute TPAUSE */
 	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			"a"(tsc_l), "d"(tsc_h));
+		: /* ignore rflags */
+		: "D"(0), /* enter C0.2 */
+		  "a"(tsc_l), "d"(tsc_h));
+}
+
+RTE_INIT(rte_power_intrinsics_init) {
+	struct rte_cpu_intrinsics i;
+
+	rte_cpu_get_intrinsics_support(&i);
+
+	if (i.power_monitor && i.power_pause)
+		wait_supported = 1;
 }
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 03/11] eal: change API of power intrinsics
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (8 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 02/11] eal: avoid invalid API usage in " Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 04/11] eal: remove sync version of power monitor Anatoly Burakov
                       ` (7 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev, thomas,
	gage.eads, david.hunt, chris.macnamara

Instead of passing around pointers and integers, collect everything
into struct. This makes API design around these intrinsics much easier.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/event/dlb/dlb.c                       | 10 ++--
 drivers/event/dlb2/dlb2.c                     | 10 ++--
 .../arm/include/rte_power_intrinsics.h        | 20 +++-----
 .../include/generic/rte_power_intrinsics.h    | 49 ++++++++-----------
 .../ppc/include/rte_power_intrinsics.h        | 20 +++-----
 lib/librte_eal/x86/rte_power_intrinsics.c     | 32 ++++++------
 6 files changed, 62 insertions(+), 79 deletions(-)

diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c
index 0c95c4793d..d2f2026291 100644
--- a/drivers/event/dlb/dlb.c
+++ b/drivers/event/dlb/dlb.c
@@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		/* Interrupts not supported by PF PMD */
 		return 1;
 	} else if (dlb->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 86724863f2..c9a8a02278 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 	if (elapsed_ticks >= timeout) {
 		return 1;
 	} else if (dlb2->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb2_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index 5e384d380e..76a5fa5234 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -17,31 +17,23 @@ extern "C" {
  * This function is not supported on ARM.
  */
 void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on ARM.
  */
 void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 }
 
 /**
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index ffa72f7578..00c670cb50 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -18,6 +18,18 @@
  * which are architecture-dependent.
  */
 
+struct rte_power_monitor_cond {
+	volatile void *addr;  /**< Address to monitor for changes */
+	uint64_t val;         /**< Before attempting the monitoring, the address
+	                       *   may be read and compared against this value.
+	                       **/
+	uint64_t mask;   /**< 64-bit mask to extract current value from addr */
+	uint8_t data_sz; /**< Data size (in bytes) that will be used to compare
+	                  *   expected value with the memory address. Can be 1,
+	                  *   2, 4, or 8. Supplying any other value will lead to
+	                  *   undefined result. */
+};
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
@@ -35,25 +47,15 @@
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  */
 __rte_experimental
-void rte_power_monitor(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz);
+void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp);
 
 /**
  * @warning
@@ -75,30 +77,19 @@ void rte_power_monitor(const volatile void *p,
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  * @param lck
  *   A spinlock that must be locked before entering the function, will be
  *   unlocked while the CPU is sleeping, and will be locked again once the CPU
  *   wakes up.
  */
 __rte_experimental
-void rte_power_monitor_sync(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz,
-		rte_spinlock_t *lck);
+void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
 
 /**
  * @warning
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 4cb5560c02..cff0996770 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -17,31 +17,23 @@ extern "C" {
  * This function is not supported on PPC64.
  */
 void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on PPC64.
  */
 void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 }
 
 /**
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index b48a54ec7f..3e224f5ac7 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -31,9 +31,8 @@ __get_umwait_val(const volatile void *p, const uint8_t sz)
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -50,14 +49,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return;
 	}
 	/* execute UMWAIT */
@@ -73,9 +73,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -92,14 +91,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return;
 	}
 	rte_spinlock_unlock(lck);
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 04/11] eal: remove sync version of power monitor
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (9 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 03/11] eal: change API of " Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 05/11] eal: add monitor wakeup function Anatoly Burakov
                       ` (6 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, the "sync" version of power monitor intrinsic is supposed to
be used for purposes of waking up a sleeping core. However, there are
better ways to achieve the same result, so remove the unneeded function.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../arm/include/rte_power_intrinsics.h        | 12 -----
 .../include/generic/rte_power_intrinsics.h    | 34 --------------
 .../ppc/include/rte_power_intrinsics.h        | 12 -----
 lib/librte_eal/version.map                    |  1 -
 lib/librte_eal/x86/rte_power_intrinsics.c     | 46 -------------------
 5 files changed, 105 deletions(-)

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index 76a5fa5234..27869251a8 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -24,18 +24,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	RTE_SET_USED(tsc_timestamp);
 }
 
-/**
- * This function is not supported on ARM.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-}
-
 /**
  * This function is not supported on ARM.
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 00c670cb50..a6f1955996 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -57,40 +57,6 @@ __rte_experimental
 void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
-/**
- * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
- *
- * Monitor specific address for changes. This will cause the CPU to enter an
- * architecture-defined optimized power state until either the specified
- * memory address is written to, a certain TSC timestamp is reached, or other
- * reasons cause the CPU to wake up.
- *
- * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
- * mask is non-zero, the current value pointed to by the `p` pointer will be
- * checked against the expected value, and if they match, the entering of
- * optimized power state may be aborted.
- *
- * This call will also lock a spinlock on entering sleep, and release it on
- * waking up the CPU.
- *
- * @warning It is responsibility of the user to check if this function is
- *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *
- * @param pmc
- *   The monitoring condition structure.
- * @param tsc_timestamp
- *   Maximum TSC timestamp to wait for. Note that the wait behavior is
- *   architecture-dependent.
- * @param lck
- *   A spinlock that must be locked before entering the function, will be
- *   unlocked while the CPU is sleeping, and will be locked again once the CPU
- *   wakes up.
- */
-__rte_experimental
-void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
-
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index cff0996770..248d1f4a23 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -24,18 +24,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	RTE_SET_USED(tsc_timestamp);
 }
 
-/**
- * This function is not supported on PPC64.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-}
-
 /**
  * This function is not supported on PPC64.
  */
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 31bf76ae81..20945b1efa 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,7 +406,6 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
-	rte_power_monitor_sync;
 	rte_power_pause;
 };
 
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 3e224f5ac7..a9cd1afe9d 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -67,52 +67,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 			  "a"(tsc_l), "d"(tsc_h));
 }
 
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* prevent user from running this instruction if it's not supported */
-	if (!wait_supported)
-		return;
-
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(pmc->addr));
-
-	if (pmc->mask) {
-		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->data_sz);
-		const uint64_t masked = cur_value & pmc->mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
-			return;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-}
-
 /**
  * This function uses TPAUSE instruction  and will enter C0.2 state. For more
  * information about usage of this instruction, please refer to Intel(R) 64 and
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 05/11] eal: add monitor wakeup function
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (10 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 04/11] eal: remove sync version of power monitor Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API Anatoly Burakov
                       ` (5 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara

Now that we have everything in a C file, we can store the information
about our sleep, and have a native mechanism to wake up the sleeping
core. This mechanism would however only wake up a core that's sleeping
while monitoring - waking up from `rte_power_pause` won't work.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../arm/include/rte_power_intrinsics.h        |  9 +++
 .../include/generic/rte_power_intrinsics.h    | 16 +++++
 .../ppc/include/rte_power_intrinsics.h        |  9 +++
 lib/librte_eal/version.map                    |  1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 70 +++++++++++++++++++
 5 files changed, 105 insertions(+)

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index 27869251a8..39e49cc45b 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp)
 	RTE_SET_USED(tsc_timestamp);
 }
 
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index a6f1955996..e311d6f8ea 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -57,6 +57,22 @@ __rte_experimental
 void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Wake up a specific lcore that is in a power optimized state and is monitoring
+ * an address.
+ *
+ * @note This function will *not* wake up a core that is in a power optimized
+ *   state due to calling `rte_power_pause`.
+ *
+ * @param lcore_id
+ *   Lcore ID of a sleeping thread.
+ */
+__rte_experimental
+void rte_power_monitor_wakeup(const unsigned int lcore_id);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 248d1f4a23..2e7db0e7eb 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp)
 	RTE_SET_USED(tsc_timestamp);
 }
 
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 20945b1efa..ac026e289d 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,6 +406,7 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
+	rte_power_monitor_wakeup;
 	rte_power_pause;
 };
 
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index a9cd1afe9d..4149b430f6 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -2,8 +2,29 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
+#include <rte_common.h>
+#include <rte_lcore.h>
+#include <rte_spinlock.h>
+
 #include "rte_power_intrinsics.h"
 
+/*
+ * Per-lcore structure holding current status of C0.2 sleeps.
+ */
+static struct power_wait_status {
+	rte_spinlock_t lock;
+	volatile void *monitor_addr; /**< NULL if not currently sleeping */
+} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+
+static inline void
+__umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
 static uint8_t wait_supported;
 
 static inline uint64_t
@@ -36,6 +57,8 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s = &wait_status[lcore_id];
 
 	/* prevent user from running this instruction if it's not supported */
 	if (!wait_supported)
@@ -60,11 +83,21 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		if (masked == pmc->val)
 			return;
 	}
+	/* update sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = pmc->addr;
+	rte_spinlock_unlock(&s->lock);
+
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			  "a"(tsc_l), "d"(tsc_h));
+
+	/* erase sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = NULL;
+	rte_spinlock_unlock(&s->lock);
 }
 
 /**
@@ -97,3 +130,40 @@ RTE_INIT(rte_power_intrinsics_init) {
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
 }
+
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	struct power_wait_status *s = &wait_status[lcore_id];
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
+	/*
+	 * There is a race condition between sleep, wakeup and locking, but we
+	 * don't need to handle it.
+	 *
+	 * Possible situations:
+	 *
+	 * 1. T1 locks, sets address, unlocks
+	 * 2. T2 locks, triggers wakeup, unlocks
+	 * 3. T1 sleeps
+	 *
+	 * In this case, because T1 has already set the address for monitoring,
+	 * we will wake up immediately even if T2 triggers wakeup before T1
+	 * goes to sleep.
+	 *
+	 * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up
+	 * 2. T2 locks, triggers wakeup, and unlocks
+	 * 3. T1 locks, erases address, and unlocks
+	 *
+	 * In this case, since we've already woken up, the "wakeup" was
+	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
+	 * wakeup address is still valid so it's perfectly safe to write it.
+	 */
+	rte_spinlock_lock(&s->lock);
+	if (s->monitor_addr != NULL)
+		__umwait_wakeup(s->monitor_addr);
+	rte_spinlock_unlock(&s->lock);
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (11 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 05/11] eal: add monitor wakeup function Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-28 11:00       ` Andrew Rybchenko
  2021-01-12 20:32       ` Lance Richardson
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 07/11] power: add PMD power management API and callback Anatoly Burakov
                       ` (4 subsequent siblings)
  17 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, konstantin.ananyev, gage.eads,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple API to allow getting the monitor conditions for
power-optimized monitoring of the RX queues from the PMD, as well as
release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v6:
    - Rebase on top of latest main
    - Ensure the API checks queue ID (Konstantin)
    - Removed accidental inclusion of unrelated release notes
    v5:
    - Bring function format in line with other functions in the file
    - Ensure the API is supported by the driver before calling it (Konstantin)

 doc/guides/rel_notes/release_21_02.rst |  4 ++++
 lib/librte_ethdev/rte_ethdev.c         | 28 ++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h         | 25 +++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h  | 22 ++++++++++++++++++++
 lib/librte_ethdev/version.map          |  3 +++
 5 files changed, 82 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index 638f98168b..feb3ff4f06 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -55,6 +55,10 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **ethdev: added 1 new API for PMD power management**
+
+  * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
+    ``rte_power_monitor()`` to enable automatic power management for PMD's.
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 17ddacc78d..58f68321ea 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_tx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid TX queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	if (pmc == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n",
+				pmc);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id],
+			pmc));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index f5f8919186..ca0f91312e 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -4334,6 +4335,30 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Retrieve the monitor condition for a given receive queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param pmc
+ *   The pointer point to power-optimized monitoring condition structure.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index 0eacfd8425..ae4f152cf0 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
 	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
 /**< @internal Unbind peer queue from the current queue. */
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an RX queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param pmc
+ *   The pointer to power-optimized monitoring condition structure.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_monitor_addr_t)(void *rxq,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -917,6 +937,8 @@ struct eth_dev_ops {
 	/**< Set up the connection between the pair of hairpin queues. */
 	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
 	/**< Disconnect the hairpin queues of a pair from each other. */
+	eth_get_monitor_addr_t get_monitor_addr;
+	/**< Get next RX queue ring entry address. */
 };
 
 /**
diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
index d3f5410806..a124e1e370 100644
--- a/lib/librte_ethdev/version.map
+++ b/lib/librte_ethdev/version.map
@@ -240,6 +240,9 @@ EXPERIMENTAL {
 	rte_flow_get_restore_info;
 	rte_flow_tunnel_action_decap_release;
 	rte_flow_tunnel_item_release;
+
+	# added in 21.02
+	rte_eth_get_monitor_addr;
 };
 
 INTERNAL {
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 07/11] power: add PMD power management API and callback
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (12 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 08/11] net/ixgbe: implement power management API Anatoly Burakov
                       ` (3 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas,
	konstantin.ananyev, gage.eads, timothy.mcdaniel,
	bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. TPAUSE/Pause instruction

   This method uses the pause (or TPAUSE, if available) instruction to
   avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Moved l3fwd-power update to the l3fwd-power-related commit
    - Some rewordings and clarifications

 doc/guides/prog_guide/power_man.rst    |  44 ++++
 doc/guides/rel_notes/release_21_02.rst |  10 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 349 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  90 +++++++
 lib/librte_power/version.map           |   5 +
 6 files changed, 501 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..02280dd689 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,47 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * Monitor
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX
+   descriptor address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not available, use ``rte_pause()``.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing ``librte_power`` library
+   functionality to scale the core frequency up/down depending on traffic
+   volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +241,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index feb3ff4f06..bf65000425 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -60,6 +60,16 @@ New Features
   * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
     ``rte_power_monitor()`` to enable automatic power management for PMD's.
 
+* **Add PMD power management helper API**
+
+  A new helper API has been added to make using Ethernet PMD power management
+  easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power
+  management schemes are supported initially:
+
+  * Power saving based on UMWAIT instruction (x86 only)
+  * Power saving based on ``rte_pause()``
+  * Power saving based on frequency scaling through the ``librte_power`` library
+
 Removed Items
 -------------
 
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 4b4cf1b90b..51a471b669 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..96427dffa5
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,349 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+static struct pmd_conf_data {
+	struct rte_cpu_intrinsics intrinsics_support;
+	/**< what do we support? */
+	uint64_t tsc_per_us;
+	/**< pre-calculated tsc diff for 1us */
+	uint64_t pause_per_us;
+	/**< how many rte_pause can we fit in a millisecond? */
+} global_data;
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	rte_spinlock_t umwait_lock;
+	/**< Per-queue status lock - used only for UMWAIT mode */
+	bool umwait_in_progress;
+	/**< are we currently sleeping? */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+static void
+calc_tsc(void)
+{
+	const uint64_t hz = rte_get_timer_hz();
+	const uint64_t tsc_per_us = hz / 1000000; /* 1us */
+
+	global_data.tsc_per_us = tsc_per_us;
+
+	/* only do this if we don't have tpause */
+	if (!global_data.intrinsics_support.power_pause) {
+		const uint64_t start = rte_rdtsc_precise();
+		const uint32_t n_pauses = 10000;
+		double us, us_per_pause;
+		uint64_t end;
+		unsigned int i;
+
+		/* estimate number of rte_pause() calls per us*/
+		for (i = 0; i < n_pauses; i++)
+			rte_pause();
+
+		end = rte_rdtsc_precise();
+		us = (end - start) / (double)tsc_per_us;
+		us_per_pause = us / n_pauses;
+
+		global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause);
+	}
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX ring for
+			 * this queue, as well as expected value and a mask.
+			 */
+			ret = rte_eth_get_monitor_addr(port_id, qidx, &pmc);
+			if (ret == 0) {
+				/*
+				 * we might get a cancellation request while
+				 * being inside the callback, in which case the
+				 * wakeup wouldn't work because it would've
+				 * arrived too early.
+				 *
+				 * to get around this, we notify the other
+				 * thread that we're sleeping, so that it can
+				 * spin until we're done. unsolicited wakeups
+				 * are perfectly safe.
+				 */
+				rte_spinlock_lock(&q_conf->umwait_lock);
+				q_conf->umwait_in_progress = true;
+				rte_spinlock_unlock(&q_conf->umwait_lock);
+
+				rte_power_monitor(&pmc, -1ULL);
+
+				rte_spinlock_lock(&q_conf->umwait_lock);
+				q_conf->umwait_in_progress = false;
+				rte_spinlock_unlock(&q_conf->umwait_lock);
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/* use tpause if we have it */
+			if (global_data.intrinsics_support.power_pause) {
+				const uint64_t cur = rte_rdtsc();
+				const uint64_t wait_tsc =
+						cur + global_data.tsc_per_us;
+				rte_power_pause(wait_tsc);
+			} else {
+				uint64_t i;
+				for (i = 0; i < global_data.pause_per_us; i++)
+					rte_pause();
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* check if queue id is valid */
+	if (queue_id >= dev->data->nb_rx_queues ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		return -EINVAL;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+	/* we need this in various places */
+	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		/* check if rte_power_monitor is supported */
+		struct rte_power_monitor_cond dummy;
+
+		if (!global_data.intrinsics_support.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_monitor_addr(port_id, queue_id,
+				&dummy) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize UMWAIT spinlock */
+		rte_spinlock_init(&queue_cfg->umwait_lock);
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* figure out various time-to-tsc conversions */
+		if (global_data.tsc_per_us == 0)
+			calc_tsc();
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		bool exit = false;
+		do {
+			/*
+			 * we may request cancellation while the other thread
+			 * has just entered the callback but hasn't started
+			 * sleeping yet, so keep waking it up until we know it's
+			 * done sleeping.
+			 */
+			rte_spinlock_lock(&queue_cfg->umwait_lock);
+			if (queue_cfg->umwait_in_progress)
+				rte_power_monitor_wakeup(lcore_id);
+			else
+				exit = true;
+			rte_spinlock_unlock(&queue_cfg->umwait_lock);
+		} while (!exit);
+	}
+	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	ret = 0;
+end:
+	return ret;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..0bfbc6ba69
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** Use power-optimized monitoring to wait for incoming traffic */
+	RTE_POWER_MGMT_TYPE_MONITOR = 1,
+	/** Use power-optimized sleep to avoid busy polling */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Use frequency scaling when traffic is low */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..61996b4d11 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,9 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+
+	# added in 21.02
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 08/11] net/ixgbe: implement power management API
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (13 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 07/11] power: add PMD power management API and callback Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 09/11] net/i40e: " Anatoly Burakov
                       ` (2 subsequent siblings)
  17 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jeff Guo, Haiyue Wang, thomas, konstantin.ananyev,
	gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 9a47a8b262..4b7a5ca60b 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -560,6 +560,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_monitor_addr     = ixgbe_get_monitor_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 6cfbb582e2..7e046a1819 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int
+ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	pmc->data_sz = sizeof(uint32_t);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 6d2f7c9da3..8a25e98df6 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,6 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 09/11] net/i40e: implement power management API
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (14 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 08/11] net/ixgbe: implement power management API Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 10/11] net/ice: " Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  17 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Beilei Xing, Jeff Guo, thomas, konstantin.ananyev,
	gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index f54769c29d..af2577a140 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -510,6 +510,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_monitor_addr             = i40e_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5df9a9df56..0b4220fc9c 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -72,6 +72,31 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	pmc->data_sz = sizeof(uint64_t);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..e1494525ce 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,7 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 10/11] net/ice: implement power management API
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (15 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 09/11] net/i40e: " Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  17 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Qiming Yang, Qi Zhang, thomas, konstantin.ananyev,
	gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  1 +
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 9a5d6a559f..c21682c120 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_monitor_addr             = ice_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 5fbd68eafc..fa9e9a235b 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int
+ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	pmc->data_sz = sizeof(uint16_t);
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 6b16716063..906fbefdc4 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -263,6 +263,7 @@ uint16_t ice_xmit_pkts_vec_avx512(void *tx_queue, struct rte_mbuf **tx_pkts,
 				  uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v12 11/11] examples/l3fwd-power: enable PMD power mgmt
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
                       ` (16 preceding siblings ...)
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 10/11] net/ice: " Anatoly Burakov
@ 2020-12-17 14:05     ` Anatoly Burakov
  17 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, thomas, konstantin.ananyev, gage.eads,
	timothy.mcdaniel, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v6:
    - Fixed typos in documentation

 .../sample_app_ug/l3_forward_power_man.rst    | 35 ++++++++
 examples/l3fwd-power/main.c                   | 89 ++++++++++++++++++-
 2 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 85a78a5c1e..aaa9367fae 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
+
+PMD Power Management Mode
+-------------------------
+There is also a traffic-aware operating mode that, instead of using explicit
+power management, will use automatic PMD power management. This mode is limited
+to one queue per core, and has three available power management schemes:
+
+* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a
+  power-optimized state (subject to platform support).
+
+* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid
+  busy looping when there is no traffic.
+
+* ``scale`` - this will use frequency scaling routines available in the
+  ``librte_power`` library.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK
+Programmer's Guide for more details on PMD power management.
+
+.. code-block:: console
+
+        ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 995a3b6ad7..e312b6f355 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,11 +200,14 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
 
+static enum rte_power_pmd_mgmt_type pmgmt_type;
+
 enum freq_scale_hint_t
 {
 	FREQ_LOWER    =      -1,
@@ -1611,7 +1615,9 @@ print_usage(const char *prgname)
 		" follow (training_flag, high_threshold, med_threshold)\n"
 		" --telemetry: enable telemetry mode, to update"
 		" empty polls, full polls, and core busyness to telemetry\n"
-		" --interrupt-only: enable interrupt-only mode\n",
+		" --interrupt-only: enable interrupt-only mode\n"
+		" --pmd-mgmt MODE: enable PMD power management mode. "
+		"Currently supported modes: monitor, pause, scale\n",
 		prgname);
 }
 
@@ -1701,6 +1707,32 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+
+static int
+parse_pmd_mgmt_config(const char *name)
+{
+#define PMD_MGMT_MONITOR "monitor"
+#define PMD_MGMT_PAUSE   "pause"
+#define PMD_MGMT_SCALE   "scale"
+
+	if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE;
+		return 0;
+	}
+	/* unknown PMD power management mode */
+	return -1;
+}
+
 static int
 parse_ep_config(const char *q_arg)
 {
@@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 1, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				if (parse_pmd_mgmt_config(optarg) < 0) {
+					printf(" Invalid PMD power management mode: %s\n",
+							optarg);
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2671,6 +2722,13 @@ main(int argc, char **argv)
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
+
+		/* PMD power management mode can only do 1 queue per core */
+		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
+			rte_exit(EXIT_FAILURE,
+				"In PMD power management mode, only one queue per lcore is allowed\n");
+		}
+
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2708,6 +2766,16 @@ main(int argc, char **argv)
 					rte_exit(EXIT_FAILURE,
 						 "Fail to add ptype cb\n");
 			}
+
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_pmd_mgmt_queue_enable(
+						lcore_id, portid, queueid,
+						pmgmt_type);
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+						"rte_power_pmd_mgmt_queue_enable: err=%d, port=%d\n",
+							ret, portid);
+			}
 		}
 	}
 
@@ -2798,6 +2866,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		/* reuse telemetry loop for PMD power management mode */
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2824,6 +2895,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v12 00/11] Add PMD power management
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 00/11] Add PMD power management Anatoly Burakov
@ 2020-12-17 16:12       ` David Marchand
  2021-01-08 16:42         ` Burakov, Anatoly
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
  1 sibling, 1 reply; 421+ messages in thread
From: David Marchand @ 2020-12-17 16:12 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Gage Eads,
	Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara,
	Ray Kinsella, Yigit, Ferruh

On Thu, Dec 17, 2020 at 3:06 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> This patchset proposes a simple API for Ethernet drivers to cause the
> CPU to enter a power-optimized state while waiting for packets to
> arrive. This is achieved through cooperation with the NIC driver that
> will allow us to know address of wake up event, and wait for writes on
> it.
>
> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
> are used in their raw opcode form because there is no widespread
> compiler support for them yet. Still, the API is made generic enough to
> hopefully support other architectures, if they happen to implement
> similar instructions.
>
> To achieve power savings, there is a very simple mechanism used: we're
> counting empty polls, and if a certain threshold is reached, we get the
> address of next RX ring descriptor from the NIC driver, arm the
> monitoring hardware, and enter a power-optimized state. We will then
> wake up when either a timeout happens, or a write happens (or generally
> whenever CPU feels like waking up - this is platform-specific), and
> proceed as normal. The empty poll counter is reset whenever we actually
> get packets, so we only go to sleep when we know nothing is going on.
> The mechanism is generic which can be used for any write back
> descriptor.
>
> This patchset also introduces a few changes into existing power
> management-related intrinsics, namely to provide a native way of waking
> up a sleeping core without application being responsible for it, as well
> as general robustness improvements. There's quite a bit of locking going
> on, but these locks are per-thread and very little (if any) contention
> is expected, so the performance impact shouldn't be that bad (and in any
> case the locking happens when we're about to sleep anyway, not on a
> hotpath).
>
> Why are we putting it into ethdev as opposed to leaving this up to the
> application? Our customers specifically requested a way to do it wit
> minimal changes to the application code. The current approach allows to
> just flip a switch and automatically have power savings.
>
> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>   must at most handle RX on a single queue
> - Support 3 type policies. Monitor/Pause/Frequency Scaling
> - Power management is enabled per-queue
> - The API doesn't extend to other device types

Fyi, ovsrobot Travis being KO, you probably missed that GHA CI caught this:
https://github.com/ovsrobot/dpdk/runs/1571056574?check_suite_focus=true#step:13:16082

We will have to put an exception on driver only ABI.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API Anatoly Burakov
@ 2020-12-28 11:00       ` Andrew Rybchenko
  2021-01-08 16:30         ` Burakov, Anatoly
  2021-01-12 20:32       ` Lance Richardson
  1 sibling, 1 reply; 421+ messages in thread
From: Andrew Rybchenko @ 2020-12-28 11:00 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Ray Kinsella,
	Neil Horman, konstantin.ananyev, gage.eads, timothy.mcdaniel,
	david.hunt, bruce.richardson, chris.macnamara

On 12/17/20 5:05 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add a simple API to allow getting the monitor conditions for
> power-optimized monitoring of the RX queues from the PMD, as well as
> release notes information.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
> 
> Notes:
>     v6:
>     - Rebase on top of latest main
>     - Ensure the API checks queue ID (Konstantin)
>     - Removed accidental inclusion of unrelated release notes
>     v5:
>     - Bring function format in line with other functions in the file
>     - Ensure the API is supported by the driver before calling it (Konstantin)
> 
>  doc/guides/rel_notes/release_21_02.rst |  4 ++++
>  lib/librte_ethdev/rte_ethdev.c         | 28 ++++++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev.h         | 25 +++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_driver.h  | 22 ++++++++++++++++++++
>  lib/librte_ethdev/version.map          |  3 +++
>  5 files changed, 82 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
> index 638f98168b..feb3ff4f06 100644
> --- a/doc/guides/rel_notes/release_21_02.rst
> +++ b/doc/guides/rel_notes/release_21_02.rst
> @@ -55,6 +55,10 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
>  
> +* **ethdev: added 1 new API for PMD power management**

"1 new API" sounds a bit confusing. May be just "a new API"?

> +
> +  * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
> +    ``rte_power_monitor()`` to enable automatic power management for PMD's.

Missing extra empty line here.

>  
>  Removed Items
>  -------------
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 17ddacc78d..58f68321ea 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
>  }
>  
> +int
> +rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
> +		struct rte_power_monitor_cond *pmc)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +
> +	dev = &rte_eth_devices[port_id];
> +
> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP);
> +
> +	if (queue_id >= dev->data->nb_tx_queues) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid TX queue_id=%u\n", queue_id);

Why is Tx queue checked and logged here, but Rx queue is used
below? I guess Rx should be used here as well.

I.e. TX -> Rx

> +		return -EINVAL;
> +	}
> +
> +	if (pmc == NULL) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n",
> +				pmc);
> +		return -EINVAL;
> +	}
> +
> +	return eth_err(port_id,
> +		dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id],
> +			pmc));
> +}
> +
>  int
>  rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>  			     struct rte_ether_addr *mc_addr_set,
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index f5f8919186..ca0f91312e 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -157,6 +157,7 @@ extern "C" {
>  #include <rte_common.h>
>  #include <rte_config.h>
>  #include <rte_ether.h>
> +#include <rte_power_intrinsics.h>
>  
>  #include "rte_ethdev_trace_fp.h"
>  #include "rte_dev_info.h"
> @@ -4334,6 +4335,30 @@ __rte_experimental
>  int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  	struct rte_eth_burst_mode *mode);
>  
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Retrieve the monitor condition for a given receive queue.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Rx queue on the Ethernet device for which information
> + *   will be retrieved.
> + * @param pmc
> + *   The pointer point to power-optimized monitoring condition structure.
> + *
> + * @return
> + *   - 0: Success.
> + *   -ENOTSUP: Operation not supported.
> + *   -EINVAL: Invalid parameters.
> + *   -ENODEV: Invalid port ID.
> + */
> +__rte_experimental
> +int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
> +		struct rte_power_monitor_cond *pmc);
> +
>  /**
>   * Retrieve device registers and register attributes (number of registers and
>   * register size)
> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
> index 0eacfd8425..ae4f152cf0 100644
> --- a/lib/librte_ethdev/rte_ethdev_driver.h
> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
> @@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
>  	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
>  /**< @internal Unbind peer queue from the current queue. */
>  
> +/**
> + * @internal
> + * Get address of memory location whose contents will change whenever there is
> + * new data to be received on an RX queue.

RX -> Rx

> + *
> + * @param rxq
> + *   Ethdev queue pointer.
> + * @param pmc
> + *   The pointer to power-optimized monitoring condition structure.
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success
> + * @retval -EINVAL
> + *   Invalid parameters
> + */
> +typedef int (*eth_get_monitor_addr_t)(void *rxq,
> +		struct rte_power_monitor_cond *pmc);
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -917,6 +937,8 @@ struct eth_dev_ops {
>  	/**< Set up the connection between the pair of hairpin queues. */
>  	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
>  	/**< Disconnect the hairpin queues of a pair from each other. */
> +	eth_get_monitor_addr_t get_monitor_addr;
> +	/**< Get next RX queue ring entry address. */

RX -> Rx

>  };
>  
>  /**
> diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
> index d3f5410806..a124e1e370 100644
> --- a/lib/librte_ethdev/version.map
> +++ b/lib/librte_ethdev/version.map
> @@ -240,6 +240,9 @@ EXPERIMENTAL {
>  	rte_flow_get_restore_info;
>  	rte_flow_tunnel_action_decap_release;
>  	rte_flow_tunnel_item_release;
> +
> +	# added in 21.02
> +	rte_eth_get_monitor_addr;
>  };
>  
>  INTERNAL {
> 


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API
  2020-12-28 11:00       ` Andrew Rybchenko
@ 2021-01-08 16:30         ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-08 16:30 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Ray Kinsella,
	Neil Horman, konstantin.ananyev, gage.eads, timothy.mcdaniel,
	david.hunt, bruce.richardson, chris.macnamara

On 28-Dec-20 11:00 AM, Andrew Rybchenko wrote:
> On 12/17/20 5:05 PM, Anatoly Burakov wrote:
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Add a simple API to allow getting the monitor conditions for
>> power-optimized monitoring of the RX queues from the PMD, as well as
>> release notes information.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>> ---

Hi Andrew,

Thanks for your review!

>> @@ -55,6 +55,10 @@ New Features
>>        Also, make sure to start the actual text at the margin.
>>        =======================================================
>>   
>> +* **ethdev: added 1 new API for PMD power management**
> 
> "1 new API" sounds a bit confusing. May be just "a new API"?
> 
>> +
>> +  * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
>> +    ``rte_power_monitor()`` to enable automatic power management for PMD's.
> 
> Missing extra empty line here.
> 

Will fix.

>>   
>>   Removed Items
>>   -------------
>> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
>> index 17ddacc78d..58f68321ea 100644
>> --- a/lib/librte_ethdev/rte_ethdev.c
>> +++ b/lib/librte_ethdev/rte_ethdev.c
>> @@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>>   		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
>>   }
>>   
>> +int
>> +rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
>> +		struct rte_power_monitor_cond *pmc)
>> +{
>> +	struct rte_eth_dev *dev;
>> +
>> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
>> +
>> +	dev = &rte_eth_devices[port_id];
>> +
>> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP);
>> +
>> +	if (queue_id >= dev->data->nb_tx_queues) {
>> +		RTE_ETHDEV_LOG(ERR, "Invalid TX queue_id=%u\n", queue_id);
> 
> Why is Tx queue checked and logged here, but Rx queue is used
> below? I guess Rx should be used here as well.
> 
> I.e. TX -> Rx
> 

Yep, fixed already.

>> +		return -EINVAL;
>> +	}
>> +
>> +	if (pmc == NULL) {
>> +		RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n",
>> +				pmc);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return eth_err(port_id,
>> +		dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id],
>> +			pmc));
>> +}
>> +
>>   int
>>   rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>>   			     struct rte_ether_addr *mc_addr_set,
>> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
>> index f5f8919186..ca0f91312e 100644
>> --- a/lib/librte_ethdev/rte_ethdev.h
>> +++ b/lib/librte_ethdev/rte_ethdev.h
>> @@ -157,6 +157,7 @@ extern "C" {
>>   #include <rte_common.h>
>>   #include <rte_config.h>
>>   #include <rte_ether.h>
>> +#include <rte_power_intrinsics.h>
>>   
>>   #include "rte_ethdev_trace_fp.h"
>>   #include "rte_dev_info.h"
>> @@ -4334,6 +4335,30 @@ __rte_experimental
>>   int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>>   	struct rte_eth_burst_mode *mode);
>>   
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Retrieve the monitor condition for a given receive queue.
>> + *
>> + * @param port_id
>> + *   The port identifier of the Ethernet device.
>> + * @param queue_id
>> + *   The Rx queue on the Ethernet device for which information
>> + *   will be retrieved.
>> + * @param pmc
>> + *   The pointer point to power-optimized monitoring condition structure.
>> + *
>> + * @return
>> + *   - 0: Success.
>> + *   -ENOTSUP: Operation not supported.
>> + *   -EINVAL: Invalid parameters.
>> + *   -ENODEV: Invalid port ID.
>> + */
>> +__rte_experimental
>> +int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
>> +		struct rte_power_monitor_cond *pmc);
>> +
>>   /**
>>    * Retrieve device registers and register attributes (number of registers and
>>    * register size)
>> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
>> index 0eacfd8425..ae4f152cf0 100644
>> --- a/lib/librte_ethdev/rte_ethdev_driver.h
>> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
>> @@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
>>   	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
>>   /**< @internal Unbind peer queue from the current queue. */
>>   
>> +/**
>> + * @internal
>> + * Get address of memory location whose contents will change whenever there is
>> + * new data to be received on an RX queue.
> 
> RX -> Rx
> 

Will fix.

>> + *
>> + * @param rxq
>> + *   Ethdev queue pointer.
>> + * @param pmc
>> + *   The pointer to power-optimized monitoring condition structure.
>> + * @return
>> + *   Negative errno value on error, 0 on success.
>> + *
>> + * @retval 0
>> + *   Success
>> + * @retval -EINVAL
>> + *   Invalid parameters
>> + */
>> +typedef int (*eth_get_monitor_addr_t)(void *rxq,
>> +		struct rte_power_monitor_cond *pmc);
>> +
>>   /**
>>    * @internal A structure containing the functions exported by an Ethernet driver.
>>    */
>> @@ -917,6 +937,8 @@ struct eth_dev_ops {
>>   	/**< Set up the connection between the pair of hairpin queues. */
>>   	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
>>   	/**< Disconnect the hairpin queues of a pair from each other. */
>> +	eth_get_monitor_addr_t get_monitor_addr;
>> +	/**< Get next RX queue ring entry address. */
> 
> RX -> Rx
> 

Will fix.

>>   };
>>   
>>   /**
>> diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
>> index d3f5410806..a124e1e370 100644
>> --- a/lib/librte_ethdev/version.map
>> +++ b/lib/librte_ethdev/version.map
>> @@ -240,6 +240,9 @@ EXPERIMENTAL {
>>   	rte_flow_get_restore_info;
>>   	rte_flow_tunnel_action_decap_release;
>>   	rte_flow_tunnel_item_release;
>> +
>> +	# added in 21.02
>> +	rte_eth_get_monitor_addr;
>>   };
>>   
>>   INTERNAL {
>>
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v12 00/11] Add PMD power management
  2020-12-17 16:12       ` David Marchand
@ 2021-01-08 16:42         ` Burakov, Anatoly
  2021-01-11  8:44           ` David Marchand
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-08 16:42 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Gage Eads,
	Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara,
	Ray Kinsella, Yigit, Ferruh

On 17-Dec-20 4:12 PM, David Marchand wrote:
> On Thu, Dec 17, 2020 at 3:06 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> This patchset proposes a simple API for Ethernet drivers to cause the
>> CPU to enter a power-optimized state while waiting for packets to
>> arrive. This is achieved through cooperation with the NIC driver that
>> will allow us to know address of wake up event, and wait for writes on
>> it.
>>
>> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
>> are used in their raw opcode form because there is no widespread
>> compiler support for them yet. Still, the API is made generic enough to
>> hopefully support other architectures, if they happen to implement
>> similar instructions.
>>
>> To achieve power savings, there is a very simple mechanism used: we're
>> counting empty polls, and if a certain threshold is reached, we get the
>> address of next RX ring descriptor from the NIC driver, arm the
>> monitoring hardware, and enter a power-optimized state. We will then
>> wake up when either a timeout happens, or a write happens (or generally
>> whenever CPU feels like waking up - this is platform-specific), and
>> proceed as normal. The empty poll counter is reset whenever we actually
>> get packets, so we only go to sleep when we know nothing is going on.
>> The mechanism is generic which can be used for any write back
>> descriptor.
>>
>> This patchset also introduces a few changes into existing power
>> management-related intrinsics, namely to provide a native way of waking
>> up a sleeping core without application being responsible for it, as well
>> as general robustness improvements. There's quite a bit of locking going
>> on, but these locks are per-thread and very little (if any) contention
>> is expected, so the performance impact shouldn't be that bad (and in any
>> case the locking happens when we're about to sleep anyway, not on a
>> hotpath).
>>
>> Why are we putting it into ethdev as opposed to leaving this up to the
>> application? Our customers specifically requested a way to do it wit
>> minimal changes to the application code. The current approach allows to
>> just flip a switch and automatically have power savings.
>>
>> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>>    must at most handle RX on a single queue
>> - Support 3 type policies. Monitor/Pause/Frequency Scaling
>> - Power management is enabled per-queue
>> - The API doesn't extend to other device types
> 
> Fyi, ovsrobot Travis being KO, you probably missed that GHA CI caught this:
> https://github.com/ovsrobot/dpdk/runs/1571056574?check_suite_focus=true#step:13:16082
> 
> We will have to put an exception on driver only ABI.
> 
> 

Why does aarch64 build fail there? The functions in question are in the 
version map file, but the build complains that they aren't.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 00/11] Add PMD power management
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 00/11] Add PMD power management Anatoly Burakov
  2020-12-17 16:12       ` David Marchand
@ 2021-01-08 17:42       ` Anatoly Burakov
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics Anatoly Burakov
                           ` (11 more replies)
  1 sibling, 12 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: thomas, konstantin.ananyev, gage.eads, timothy.mcdaniel,
	david.hunt, bruce.richardson, chris.macnamara

This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. There are multiple proposed mechanisms to achieve said power
savings: simple frequency scaling, idle loop, and monitoring the Rx
queue for incoming packages. The latter is achieved through cooperation
with the NIC driver that will allow us to know address of wake up event,
and wait for writes on that address.

On IA, this is achieved through using UMONITOR/UMWAIT instructions. They 
are used in their raw opcode form because there is no widespread 
compiler support for them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen to implement 
similar instructions.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we employ
one of the suggested power management schemes automatically, from within
a Rx callback inside the PMD. Once there's traffic again, the empty poll
counter is reset.

This patchset also introduces a few changes into existing power 
management-related intrinsics, namely to provide a native way of waking 
up a sleeping core without application being responsible for it, as well 
as general robustness improvements. There's quite a bit of locking going 
on, but these locks are per-thread and very little (if any) contention 
is expected, so the performance impact shouldn't be that bad (and in any 
case the locking happens when we're about to sleep anyway).

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it with
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

Things of note:

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

v13:
- Reworked the librte_power code to require less locking and handle invalid
  parameters better
- Fix numerous rebase errors present in v12

v12:
- Rebase on top of 21.02
- Rework of power intrinsics code

Anatoly Burakov (5):
  eal: uninline power intrinsics
  eal: avoid invalid API usage in power intrinsics
  eal: change API of power intrinsics
  eal: remove sync version of power monitor
  eal: add monitor wakeup function

Liang Ma (6):
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  44 +++
 doc/guides/rel_notes/release_21_02.rst        |  15 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 drivers/event/dlb/dlb.c                       |  10 +-
 drivers/event/dlb2/dlb2.c                     |  10 +-
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  25 ++
 drivers/net/i40e/i40e_rxtx.h                  |   1 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   1 +
 examples/l3fwd-power/main.c                   |  89 ++++-
 .../arm/include/rte_power_intrinsics.h        |  39 +-
 .../include/generic/rte_power_intrinsics.h    |  78 ++--
 .../ppc/include/rte_power_intrinsics.h        |  39 +-
 lib/librte_eal/version.map                    |   5 +
 .../x86/include/rte_power_intrinsics.h        | 115 ------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 184 +++++++++
 lib/librte_ethdev/rte_ethdev.c                |  28 ++
 lib/librte_ethdev/rte_ethdev.h                |  25 ++
 lib/librte_ethdev/rte_ethdev_driver.h         |  22 ++
 lib/librte_ethdev/version.map                 |   3 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 360 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  90 +++++
 lib/librte_power/version.map                  |   5 +
 30 files changed, 1055 insertions(+), 229 deletions(-)
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-12 15:54           ` Ananyev, Konstantin
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in " Anatoly Burakov
                           ` (10 subsequent siblings)
  11 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, power intrinsics are inline functions. Make them part of the
ABI so that we can have various internal data associated with them
without exposing said data to the outside world.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../arm/include/rte_power_intrinsics.h        |   6 +-
 .../include/generic/rte_power_intrinsics.h    |   6 +-
 .../ppc/include/rte_power_intrinsics.h        |   6 +-
 lib/librte_eal/version.map                    |   5 +
 .../x86/include/rte_power_intrinsics.h        | 115 -----------------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 120 ++++++++++++++++++
 7 files changed, 135 insertions(+), 124 deletions(-)
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index a4a1bc1159..5e384d380e 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -16,7 +16,7 @@ extern "C" {
 /**
  * This function is not supported on ARM.
  */
-static inline void
+void
 rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz)
@@ -31,7 +31,7 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 /**
  * This function is not supported on ARM.
  */
-static inline void
+void
 rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz, rte_spinlock_t *lck)
@@ -47,7 +47,7 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 /**
  * This function is not supported on ARM.
  */
-static inline void
+void
 rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index dd520d90fa..67977bd511 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -52,7 +52,7 @@
  *   to undefined result.
  */
 __rte_experimental
-static inline void rte_power_monitor(const volatile void *p,
+void rte_power_monitor(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz);
 
@@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p,
  *   wakes up.
  */
 __rte_experimental
-static inline void rte_power_monitor_sync(const volatile void *p,
+void rte_power_monitor_sync(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz,
 		rte_spinlock_t *lck);
@@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  *   architecture-dependent.
  */
 __rte_experimental
-static inline void rte_power_pause(const uint64_t tsc_timestamp);
+void rte_power_pause(const uint64_t tsc_timestamp);
 
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 4ed03d521f..4cb5560c02 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -16,7 +16,7 @@ extern "C" {
 /**
  * This function is not supported on PPC64.
  */
-static inline void
+void
 rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz)
@@ -31,7 +31,7 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 /**
  * This function is not supported on PPC64.
  */
-static inline void
+void
 rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz, rte_spinlock_t *lck)
@@ -47,7 +47,7 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 /**
  * This function is not supported on PPC64.
  */
-static inline void
+void
 rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 354c068f31..31bf76ae81 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -403,6 +403,11 @@ EXPERIMENTAL {
 	rte_service_lcore_may_be_active;
 	rte_vect_get_max_simd_bitwidth;
 	rte_vect_set_max_simd_bitwidth;
+
+	# added in 21.02
+	rte_power_monitor;
+	rte_power_monitor_sync;
+	rte_power_pause;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
index c7d790c854..e4c2b87f73 100644
--- a/lib/librte_eal/x86/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -13,121 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-static inline uint64_t
-__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz)
-{
-	switch (sz) {
-	case sizeof(uint8_t):
-		return *(const volatile uint8_t *)p;
-	case sizeof(uint16_t):
-		return *(const volatile uint16_t *)p;
-	case sizeof(uint32_t):
-		return *(const volatile uint32_t *)p;
-	case sizeof(uint64_t):
-		return *(const volatile uint64_t *)p;
-	default:
-		/* this is an intrinsic, so we can't have any error handling */
-		RTE_ASSERT(0);
-		return 0;
-	}
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-}
-
-/**
- * This function uses TPAUSE instruction  and will enter C0.2 state. For more
- * information about usage of this instruction, please refer to Intel(R) 64 and
- * IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* execute TPAUSE */
-	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
-		: /* ignore rflags */
-		: "D"(0), /* enter C0.2 */
-		  "a"(tsc_l), "d"(tsc_h));
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build
index e78f29002e..dfd42dee0c 100644
--- a/lib/librte_eal/x86/meson.build
+++ b/lib/librte_eal/x86/meson.build
@@ -8,4 +8,5 @@ sources += files(
 	'rte_cycles.c',
 	'rte_hypervisor.c',
 	'rte_spinlock.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
new file mode 100644
index 0000000000..34c5fd9c3e
--- /dev/null
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -0,0 +1,120 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+static inline uint64_t
+__get_umwait_val(const volatile void *p, const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):
+		return *(const volatile uint8_t *)p;
+	case sizeof(uint16_t):
+		return *(const volatile uint16_t *)p;
+	case sizeof(uint32_t):
+		return *(const volatile uint32_t *)p;
+	case sizeof(uint64_t):
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		RTE_ASSERT(0);
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			"a"(tsc_l), "d"(tsc_h));
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in power intrinsics
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-08 19:58           ` Stephen Hemminger
  2021-01-12 15:56           ` Ananyev, Konstantin
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 03/11] eal: change API of " Anatoly Burakov
                           ` (9 subsequent siblings)
  11 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Bruce Richardson, Konstantin Ananyev, thomas, gage.eads,
	timothy.mcdaniel, david.hunt, chris.macnamara

Currently, the API documentation mandates that if the user wants to use
the power management intrinsics, they need to call the
`rte_cpu_get_intrinsics_support` API and check support for specific
intrinsics.

However, if the user does not do that, it is possible to get illegal
instruction error because we're using raw instruction opcodes, which may
or may not be supported at runtime.

Now that we have everything in a C file, we can check for support at
startup and prevent the user from possibly encountering illegal
instruction errors.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  3 --
 lib/librte_eal/x86/rte_power_intrinsics.c     | 31 +++++++++++++++++--
 2 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 67977bd511..ffa72f7578 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -34,7 +34,6 @@
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -75,7 +74,6 @@ void rte_power_monitor(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -111,7 +109,6 @@ void rte_power_monitor_sync(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 34c5fd9c3e..b48a54ec7f 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,8 @@
 
 #include "rte_power_intrinsics.h"
 
+static uint8_t wait_supported;
+
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
 {
@@ -35,6 +37,11 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -72,6 +79,11 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -112,9 +124,22 @@ rte_power_pause(const uint64_t tsc_timestamp)
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
 
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/* execute TPAUSE */
 	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			"a"(tsc_l), "d"(tsc_h));
+		: /* ignore rflags */
+		: "D"(0), /* enter C0.2 */
+		  "a"(tsc_l), "d"(tsc_h));
+}
+
+RTE_INIT(rte_power_intrinsics_init) {
+	struct rte_cpu_intrinsics i;
+
+	rte_cpu_get_intrinsics_support(&i);
+
+	if (i.power_monitor && i.power_pause)
+		wait_supported = 1;
 }
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 03/11] eal: change API of power intrinsics
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics Anatoly Burakov
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in " Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-12 15:58           ` Ananyev, Konstantin
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor Anatoly Burakov
                           ` (8 subsequent siblings)
  11 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev, thomas,
	gage.eads, david.hunt, chris.macnamara

Instead of passing around pointers and integers, collect everything
into struct. This makes API design around these intrinsics much easier.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/event/dlb/dlb.c                       | 10 ++--
 drivers/event/dlb2/dlb2.c                     | 10 ++--
 .../arm/include/rte_power_intrinsics.h        | 20 +++-----
 .../include/generic/rte_power_intrinsics.h    | 49 ++++++++-----------
 .../ppc/include/rte_power_intrinsics.h        | 20 +++-----
 lib/librte_eal/x86/rte_power_intrinsics.c     | 32 ++++++------
 6 files changed, 62 insertions(+), 79 deletions(-)

diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c
index 0c95c4793d..d2f2026291 100644
--- a/drivers/event/dlb/dlb.c
+++ b/drivers/event/dlb/dlb.c
@@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		/* Interrupts not supported by PF PMD */
 		return 1;
 	} else if (dlb->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 86724863f2..c9a8a02278 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 	if (elapsed_ticks >= timeout) {
 		return 1;
 	} else if (dlb2->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb2_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index 5e384d380e..76a5fa5234 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -17,31 +17,23 @@ extern "C" {
  * This function is not supported on ARM.
  */
 void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on ARM.
  */
 void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 }
 
 /**
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index ffa72f7578..00c670cb50 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -18,6 +18,18 @@
  * which are architecture-dependent.
  */
 
+struct rte_power_monitor_cond {
+	volatile void *addr;  /**< Address to monitor for changes */
+	uint64_t val;         /**< Before attempting the monitoring, the address
+	                       *   may be read and compared against this value.
+	                       **/
+	uint64_t mask;   /**< 64-bit mask to extract current value from addr */
+	uint8_t data_sz; /**< Data size (in bytes) that will be used to compare
+	                  *   expected value with the memory address. Can be 1,
+	                  *   2, 4, or 8. Supplying any other value will lead to
+	                  *   undefined result. */
+};
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
@@ -35,25 +47,15 @@
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  */
 __rte_experimental
-void rte_power_monitor(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz);
+void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp);
 
 /**
  * @warning
@@ -75,30 +77,19 @@ void rte_power_monitor(const volatile void *p,
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  * @param lck
  *   A spinlock that must be locked before entering the function, will be
  *   unlocked while the CPU is sleeping, and will be locked again once the CPU
  *   wakes up.
  */
 __rte_experimental
-void rte_power_monitor_sync(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz,
-		rte_spinlock_t *lck);
+void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
 
 /**
  * @warning
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 4cb5560c02..cff0996770 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -17,31 +17,23 @@ extern "C" {
  * This function is not supported on PPC64.
  */
 void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on PPC64.
  */
 void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 }
 
 /**
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index b48a54ec7f..3e224f5ac7 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -31,9 +31,8 @@ __get_umwait_val(const volatile void *p, const uint8_t sz)
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -50,14 +49,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return;
 	}
 	/* execute UMWAIT */
@@ -73,9 +73,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -92,14 +91,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return;
 	}
 	rte_spinlock_unlock(lck);
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
                           ` (2 preceding siblings ...)
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 03/11] eal: change API of " Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-12 15:59           ` Ananyev, Konstantin
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function Anatoly Burakov
                           ` (7 subsequent siblings)
  11 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, the "sync" version of power monitor intrinsic is supposed to
be used for purposes of waking up a sleeping core. However, there are
better ways to achieve the same result, so remove the unneeded function.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../arm/include/rte_power_intrinsics.h        | 12 -----
 .../include/generic/rte_power_intrinsics.h    | 34 --------------
 .../ppc/include/rte_power_intrinsics.h        | 12 -----
 lib/librte_eal/version.map                    |  1 -
 lib/librte_eal/x86/rte_power_intrinsics.c     | 46 -------------------
 5 files changed, 105 deletions(-)

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index 76a5fa5234..27869251a8 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -24,18 +24,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	RTE_SET_USED(tsc_timestamp);
 }
 
-/**
- * This function is not supported on ARM.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-}
-
 /**
  * This function is not supported on ARM.
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 00c670cb50..a6f1955996 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -57,40 +57,6 @@ __rte_experimental
 void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
-/**
- * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
- *
- * Monitor specific address for changes. This will cause the CPU to enter an
- * architecture-defined optimized power state until either the specified
- * memory address is written to, a certain TSC timestamp is reached, or other
- * reasons cause the CPU to wake up.
- *
- * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
- * mask is non-zero, the current value pointed to by the `p` pointer will be
- * checked against the expected value, and if they match, the entering of
- * optimized power state may be aborted.
- *
- * This call will also lock a spinlock on entering sleep, and release it on
- * waking up the CPU.
- *
- * @warning It is responsibility of the user to check if this function is
- *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *
- * @param pmc
- *   The monitoring condition structure.
- * @param tsc_timestamp
- *   Maximum TSC timestamp to wait for. Note that the wait behavior is
- *   architecture-dependent.
- * @param lck
- *   A spinlock that must be locked before entering the function, will be
- *   unlocked while the CPU is sleeping, and will be locked again once the CPU
- *   wakes up.
- */
-__rte_experimental
-void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
-
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index cff0996770..248d1f4a23 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -24,18 +24,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	RTE_SET_USED(tsc_timestamp);
 }
 
-/**
- * This function is not supported on PPC64.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-}
-
 /**
  * This function is not supported on PPC64.
  */
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 31bf76ae81..20945b1efa 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,7 +406,6 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
-	rte_power_monitor_sync;
 	rte_power_pause;
 };
 
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 3e224f5ac7..a9cd1afe9d 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -67,52 +67,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 			  "a"(tsc_l), "d"(tsc_h));
 }
 
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* prevent user from running this instruction if it's not supported */
-	if (!wait_supported)
-		return;
-
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(pmc->addr));
-
-	if (pmc->mask) {
-		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->data_sz);
-		const uint64_t masked = cur_value & pmc->mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
-			return;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-}
-
 /**
  * This function uses TPAUSE instruction  and will enter C0.2 state. For more
  * information about usage of this instruction, please refer to Intel(R) 64 and
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
                           ` (3 preceding siblings ...)
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-12 16:02           ` Ananyev, Konstantin
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API Anatoly Burakov
                           ` (6 subsequent siblings)
  11 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara

Now that we have everything in a C file, we can store the information
about our sleep, and have a native mechanism to wake up the sleeping
core. This mechanism would however only wake up a core that's sleeping
while monitoring - waking up from `rte_power_pause` won't work.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v13:
    - Add comments around wakeup code to explain what it does
    - Add lcore_id parameter checking to prevent buffer overrun

 .../arm/include/rte_power_intrinsics.h        |  9 ++
 .../include/generic/rte_power_intrinsics.h    | 16 ++++
 .../ppc/include/rte_power_intrinsics.h        |  9 ++
 lib/librte_eal/version.map                    |  1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 85 +++++++++++++++++++
 5 files changed, 120 insertions(+)

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index 27869251a8..39e49cc45b 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp)
 	RTE_SET_USED(tsc_timestamp);
 }
 
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index a6f1955996..e311d6f8ea 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -57,6 +57,22 @@ __rte_experimental
 void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Wake up a specific lcore that is in a power optimized state and is monitoring
+ * an address.
+ *
+ * @note This function will *not* wake up a core that is in a power optimized
+ *   state due to calling `rte_power_pause`.
+ *
+ * @param lcore_id
+ *   Lcore ID of a sleeping thread.
+ */
+__rte_experimental
+void rte_power_monitor_wakeup(const unsigned int lcore_id);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 248d1f4a23..2e7db0e7eb 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp)
 	RTE_SET_USED(tsc_timestamp);
 }
 
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 20945b1efa..ac026e289d 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,6 +406,7 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
+	rte_power_monitor_wakeup;
 	rte_power_pause;
 };
 
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index a9cd1afe9d..46a4fb6cd5 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -2,8 +2,31 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
+#include <rte_common.h>
+#include <rte_lcore.h>
+#include <rte_spinlock.h>
+
 #include "rte_power_intrinsics.h"
 
+/*
+ * Per-lcore structure holding current status of C0.2 sleeps.
+ */
+static struct power_wait_status {
+	rte_spinlock_t lock;
+	volatile void *monitor_addr; /**< NULL if not currently sleeping */
+} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+
+static inline void
+__umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	/* trigger a write but don't change the value */
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
 static uint8_t wait_supported;
 
 static inline uint64_t
@@ -36,6 +59,12 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s;
+
+	/* prevent non-EAL thread from using this API */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
 
 	/* prevent user from running this instruction if it's not supported */
 	if (!wait_supported)
@@ -60,11 +89,24 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		if (masked == pmc->val)
 			return;
 	}
+
+	s = &wait_status[lcore_id];
+
+	/* update sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = pmc->addr;
+	rte_spinlock_unlock(&s->lock);
+
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			  "a"(tsc_l), "d"(tsc_h));
+
+	/* erase sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = NULL;
+	rte_spinlock_unlock(&s->lock);
 }
 
 /**
@@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) {
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
 }
+
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	struct power_wait_status *s;
+
+	/* prevent buffer overrun */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
+	s = &wait_status[lcore_id];
+
+	/*
+	 * There is a race condition between sleep, wakeup and locking, but we
+	 * don't need to handle it.
+	 *
+	 * Possible situations:
+	 *
+	 * 1. T1 locks, sets address, unlocks
+	 * 2. T2 locks, triggers wakeup, unlocks
+	 * 3. T1 sleeps
+	 *
+	 * In this case, because T1 has already set the address for monitoring,
+	 * we will wake up immediately even if T2 triggers wakeup before T1
+	 * goes to sleep.
+	 *
+	 * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up
+	 * 2. T2 locks, triggers wakeup, and unlocks
+	 * 3. T1 locks, erases address, and unlocks
+	 *
+	 * In this case, since we've already woken up, the "wakeup" was
+	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
+	 * wakeup address is still valid so it's perfectly safe to write it.
+	 */
+	rte_spinlock_lock(&s->lock);
+	if (s->monitor_addr != NULL)
+		__umwait_wakeup(s->monitor_addr);
+	rte_spinlock_unlock(&s->lock);
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
                           ` (4 preceding siblings ...)
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-09  8:04           ` Andrew Rybchenko
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 07/11] power: add PMD power management API and callback Anatoly Burakov
                           ` (5 subsequent siblings)
  11 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, konstantin.ananyev, gage.eads,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple API to allow getting the monitor conditions for
power-optimized monitoring of the RX queues from the PMD, as well as
release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v13:
    - Fix typos and issues raised by Andrew

 doc/guides/rel_notes/release_21_02.rst |  5 +++++
 lib/librte_ethdev/rte_ethdev.c         | 28 ++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h         | 25 +++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h  | 22 ++++++++++++++++++++
 lib/librte_ethdev/version.map          |  3 +++
 5 files changed, 83 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index 638f98168b..6de0cb568e 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **ethdev: added new API for PMD power management**
+
+  * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
+    ``rte_power_monitor()`` to enable automatic power management for PMD's.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 17ddacc78d..e19dbd838b 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	if (pmc == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n",
+				pmc);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id],
+			pmc));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index f5f8919186..ca0f91312e 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -4334,6 +4335,30 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Retrieve the monitor condition for a given receive queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param pmc
+ *   The pointer point to power-optimized monitoring condition structure.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index 0eacfd8425..3b3b0ec1a0 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
 	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
 /**< @internal Unbind peer queue from the current queue. */
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an Rx queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param pmc
+ *   The pointer to power-optimized monitoring condition structure.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_monitor_addr_t)(void *rxq,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -917,6 +937,8 @@ struct eth_dev_ops {
 	/**< Set up the connection between the pair of hairpin queues. */
 	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
 	/**< Disconnect the hairpin queues of a pair from each other. */
+	eth_get_monitor_addr_t get_monitor_addr;
+	/**< Get power monitoring condition for Rx queue. */
 };
 
 /**
diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
index d3f5410806..a124e1e370 100644
--- a/lib/librte_ethdev/version.map
+++ b/lib/librte_ethdev/version.map
@@ -240,6 +240,9 @@ EXPERIMENTAL {
 	rte_flow_get_restore_info;
 	rte_flow_tunnel_action_decap_release;
 	rte_flow_tunnel_item_release;
+
+	# added in 21.02
+	rte_eth_get_monitor_addr;
 };
 
 INTERNAL {
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 07/11] power: add PMD power management API and callback
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
                           ` (5 preceding siblings ...)
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 08/11] net/ixgbe: implement power management API Anatoly Burakov
                           ` (4 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas,
	konstantin.ananyev, gage.eads, timothy.mcdaniel,
	bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. TPAUSE/Pause instruction

   This method uses the pause (or TPAUSE, if available) instruction to
   avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v13:
    - Rework the synchronization mechanism to not require locking
    - Add more parameter checking
    - Rework n_rx_queues access to not go through internal PMD structures and use
      public API instead

 doc/guides/prog_guide/power_man.rst    |  44 +++
 doc/guides/rel_notes/release_21_02.rst |  10 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 360 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  90 +++++++
 lib/librte_power/version.map           |   5 +
 6 files changed, 512 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..02280dd689 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,47 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * Monitor
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX
+   descriptor address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not available, use ``rte_pause()``.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing ``librte_power`` library
+   functionality to scale the core frequency up/down depending on traffic
+   volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +241,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index 6de0cb568e..b34828cad6 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -60,6 +60,16 @@ New Features
   * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
     ``rte_power_monitor()`` to enable automatic power management for PMD's.
 
+* **Add PMD power management helper API**
+
+  A new helper API has been added to make using Ethernet PMD power management
+  easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power
+  management schemes are supported initially:
+
+  * Power saving based on UMWAIT instruction (x86 only)
+  * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only)
+  * Power saving based on frequency scaling through the ``librte_power`` library
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 4b4cf1b90b..51a471b669 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..65597d354c
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,360 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+static struct pmd_conf_data {
+	struct rte_cpu_intrinsics intrinsics_support;
+	/**< what do we support? */
+	uint64_t tsc_per_us;
+	/**< pre-calculated tsc diff for 1us */
+	uint64_t pause_per_us;
+	/**< how many rte_pause can we fit in a microisecond? */
+} global_data;
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+	/** Device powermanagement status is about to change. */
+	PMD_MGMT_BUSY
+};
+
+struct pmd_queue_cfg {
+	volatile enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	volatile bool umwait_in_progress;
+	/**< are we currently sleeping? */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+static void
+calc_tsc(void)
+{
+	const uint64_t hz = rte_get_timer_hz();
+	const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */
+
+	global_data.tsc_per_us = tsc_per_us;
+
+	/* only do this if we don't have tpause */
+	if (!global_data.intrinsics_support.power_pause) {
+		const uint64_t start = rte_rdtsc_precise();
+		const uint32_t n_pauses = 10000;
+		double us, us_per_pause;
+		uint64_t end;
+		unsigned int i;
+
+		/* estimate number of rte_pause() calls per us*/
+		for (i = 0; i < n_pauses; i++)
+			rte_pause();
+
+		end = rte_rdtsc_precise();
+		us = (end - start) / (double)tsc_per_us;
+		us_per_pause = us / n_pauses;
+
+		global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause);
+	}
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc;
+			uint16_t ret;
+
+			/*
+			 * we might get a cancellation request while being
+			 * inside the callback, in which case the wakeup
+			 * wouldn't work because it would've arrived too early.
+			 *
+			 * to get around this, we notify the other thread that
+			 * we're sleeping, so that it can spin until we're done.
+			 * unsolicited wakeups are perfectly safe.
+			 */
+			q_conf->umwait_in_progress = true;
+
+			/* check if we need to cancel sleep */
+			if (q_conf->pwr_mgmt_state != PMD_MGMT_ENABLED) {
+				/* use monitoring condition to sleep */
+				ret = rte_eth_get_monitor_addr(port_id, qidx,
+						&pmc);
+				if (ret == 0)
+					rte_power_monitor(&pmc, -1ULL);
+			}
+			q_conf->umwait_in_progress = false;
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/* use tpause if we have it */
+			if (global_data.intrinsics_support.power_pause) {
+				const uint64_t cur = rte_rdtsc();
+				const uint64_t wait_tsc =
+						cur + global_data.tsc_per_us;
+				rte_power_pause(wait_tsc);
+			} else {
+				uint64_t i;
+				for (i = 0; i < global_data.pause_per_us; i++)
+					rte_pause();
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	struct rte_eth_dev_info info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if (rte_eth_dev_info_get(port_id, &info) < 0) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* check if queue id is valid */
+	if (queue_id >= info.nb_rx_queues) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* we're about to change our state */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
+
+	/* we need this in various places */
+	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		struct rte_power_monitor_cond dummy;
+
+		/* check if rte_power_monitor is supported */
+		if (!global_data.intrinsics_support.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_monitor_addr(port_id, queue_id,
+				&dummy) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->umwait_in_progress = false;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto rollback;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* figure out various time-to-tsc conversions */
+		if (global_data.tsc_per_us == 0)
+			calc_tsc();
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+	return ret;
+
+rollback:
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	/* let the callback know we're shutting down */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		bool exit = false;
+		do {
+			/*
+			 * we may request cancellation while the other thread
+			 * has just entered the callback but hasn't started
+			 * sleeping yet, so keep waking it up until we know it's
+			 * done sleeping.
+			 */
+			if (queue_cfg->umwait_in_progress)
+				rte_power_monitor_wakeup(lcore_id);
+			else
+				exit = true;
+		} while (!exit);
+	}
+	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..0bfbc6ba69
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** Use power-optimized monitoring to wait for incoming traffic */
+	RTE_POWER_MGMT_TYPE_MONITOR = 1,
+	/** Use power-optimized sleep to avoid busy polling */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Use frequency scaling when traffic is low */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..61996b4d11 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,9 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+
+	# added in 21.02
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 08/11] net/ixgbe: implement power management API
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
                           ` (6 preceding siblings ...)
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 07/11] power: add PMD power management API and callback Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 09/11] net/i40e: " Anatoly Burakov
                           ` (3 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jeff Guo, Haiyue Wang, thomas, konstantin.ananyev,
	gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 9a47a8b262..4b7a5ca60b 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -560,6 +560,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_monitor_addr     = ixgbe_get_monitor_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 6cfbb582e2..7e046a1819 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int
+ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	pmc->data_sz = sizeof(uint32_t);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 6d2f7c9da3..8a25e98df6 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,6 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 09/11] net/i40e: implement power management API
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
                           ` (7 preceding siblings ...)
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 08/11] net/ixgbe: implement power management API Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 10/11] net/ice: " Anatoly Burakov
                           ` (2 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Beilei Xing, Jeff Guo, thomas, konstantin.ananyev,
	gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index f54769c29d..af2577a140 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -510,6 +510,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_monitor_addr             = i40e_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5df9a9df56..0b4220fc9c 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -72,6 +72,31 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	pmc->data_sz = sizeof(uint64_t);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..e1494525ce 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,7 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 10/11] net/ice: implement power management API
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
                           ` (8 preceding siblings ...)
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 09/11] net/i40e: " Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Qiming Yang, Qi Zhang, thomas, konstantin.ananyev,
	gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  1 +
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 9a5d6a559f..c21682c120 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_monitor_addr             = ice_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 5fbd68eafc..fa9e9a235b 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int
+ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	pmc->data_sz = sizeof(uint16_t);
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 6b16716063..906fbefdc4 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -263,6 +263,7 @@ uint16_t ice_xmit_pkts_vec_avx512(void *tx_queue, struct rte_mbuf **tx_pkts,
 				  uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v13 11/11] examples/l3fwd-power: enable PMD power mgmt
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
                           ` (9 preceding siblings ...)
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 10/11] net/ice: " Anatoly Burakov
@ 2021-01-08 17:42         ` Anatoly Burakov
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, thomas, konstantin.ananyev, gage.eads,
	timothy.mcdaniel, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v12:
    - Allow selecting PMD power management scheme from command-line
    - Enforce 1 core 1 queue rule

 .../sample_app_ug/l3_forward_power_man.rst    | 35 ++++++++
 examples/l3fwd-power/main.c                   | 89 ++++++++++++++++++-
 2 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 85a78a5c1e..aaa9367fae 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
+
+PMD Power Management Mode
+-------------------------
+There is also a traffic-aware operating mode that, instead of using explicit
+power management, will use automatic PMD power management. This mode is limited
+to one queue per core, and has three available power management schemes:
+
+* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a
+  power-optimized state (subject to platform support).
+
+* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid
+  busy looping when there is no traffic.
+
+* ``scale`` - this will use frequency scaling routines available in the
+  ``librte_power`` library.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK
+Programmer's Guide for more details on PMD power management.
+
+.. code-block:: console
+
+        ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 995a3b6ad7..e312b6f355 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,11 +200,14 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
 
+static enum rte_power_pmd_mgmt_type pmgmt_type;
+
 enum freq_scale_hint_t
 {
 	FREQ_LOWER    =      -1,
@@ -1611,7 +1615,9 @@ print_usage(const char *prgname)
 		" follow (training_flag, high_threshold, med_threshold)\n"
 		" --telemetry: enable telemetry mode, to update"
 		" empty polls, full polls, and core busyness to telemetry\n"
-		" --interrupt-only: enable interrupt-only mode\n",
+		" --interrupt-only: enable interrupt-only mode\n"
+		" --pmd-mgmt MODE: enable PMD power management mode. "
+		"Currently supported modes: monitor, pause, scale\n",
 		prgname);
 }
 
@@ -1701,6 +1707,32 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+
+static int
+parse_pmd_mgmt_config(const char *name)
+{
+#define PMD_MGMT_MONITOR "monitor"
+#define PMD_MGMT_PAUSE   "pause"
+#define PMD_MGMT_SCALE   "scale"
+
+	if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE;
+		return 0;
+	}
+	/* unknown PMD power management mode */
+	return -1;
+}
+
 static int
 parse_ep_config(const char *q_arg)
 {
@@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 1, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				if (parse_pmd_mgmt_config(optarg) < 0) {
+					printf(" Invalid PMD power management mode: %s\n",
+							optarg);
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2671,6 +2722,13 @@ main(int argc, char **argv)
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
+
+		/* PMD power management mode can only do 1 queue per core */
+		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
+			rte_exit(EXIT_FAILURE,
+				"In PMD power management mode, only one queue per lcore is allowed\n");
+		}
+
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2708,6 +2766,16 @@ main(int argc, char **argv)
 					rte_exit(EXIT_FAILURE,
 						 "Fail to add ptype cb\n");
 			}
+
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_pmd_mgmt_queue_enable(
+						lcore_id, portid, queueid,
+						pmgmt_type);
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+						"rte_power_pmd_mgmt_queue_enable: err=%d, port=%d\n",
+							ret, portid);
+			}
 		}
 	}
 
@@ -2798,6 +2866,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		/* reuse telemetry loop for PMD power management mode */
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2824,6 +2895,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in power intrinsics
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in " Anatoly Burakov
@ 2021-01-08 19:58           ` Stephen Hemminger
  2021-01-11 10:21             ` Burakov, Anatoly
  2021-01-12 15:56           ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Stephen Hemminger @ 2021-01-08 19:58 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, Bruce Richardson, Konstantin Ananyev, thomas, gage.eads,
	timothy.mcdaniel, david.hunt, chris.macnamara

On Fri,  8 Jan 2021 17:42:05 +0000
Anatoly Burakov <anatoly.burakov@intel.com> wrote:

> +static uint8_t wait_supported;

Since it is being used as a flag, bool is more common usage.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API Anatoly Burakov
@ 2021-01-09  8:04           ` Andrew Rybchenko
  0 siblings, 0 replies; 421+ messages in thread
From: Andrew Rybchenko @ 2021-01-09  8:04 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Ray Kinsella,
	Neil Horman, konstantin.ananyev, gage.eads, timothy.mcdaniel,
	david.hunt, bruce.richardson, chris.macnamara

On 1/8/21 8:42 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add a simple API to allow getting the monitor conditions for
> power-optimized monitoring of the RX queues from the PMD, as well as
> release notes information.

RX -> Rx

> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v12 00/11] Add PMD power management
  2021-01-08 16:42         ` Burakov, Anatoly
@ 2021-01-11  8:44           ` David Marchand
  2021-01-11  8:52             ` David Marchand
  0 siblings, 1 reply; 421+ messages in thread
From: David Marchand @ 2021-01-11  8:44 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Gage Eads,
	Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara,
	Ray Kinsella, Yigit, Ferruh

On Fri, Jan 8, 2021 at 5:42 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 17-Dec-20 4:12 PM, David Marchand wrote:
> > On Thu, Dec 17, 2020 at 3:06 PM Anatoly Burakov
> > <anatoly.burakov@intel.com> wrote:
> >>
> >> This patchset proposes a simple API for Ethernet drivers to cause the
> >> CPU to enter a power-optimized state while waiting for packets to
> >> arrive. This is achieved through cooperation with the NIC driver that
> >> will allow us to know address of wake up event, and wait for writes on
> >> it.
> >>
> >> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
> >> are used in their raw opcode form because there is no widespread
> >> compiler support for them yet. Still, the API is made generic enough to
> >> hopefully support other architectures, if they happen to implement
> >> similar instructions.
> >>
> >> To achieve power savings, there is a very simple mechanism used: we're
> >> counting empty polls, and if a certain threshold is reached, we get the
> >> address of next RX ring descriptor from the NIC driver, arm the
> >> monitoring hardware, and enter a power-optimized state. We will then
> >> wake up when either a timeout happens, or a write happens (or generally
> >> whenever CPU feels like waking up - this is platform-specific), and
> >> proceed as normal. The empty poll counter is reset whenever we actually
> >> get packets, so we only go to sleep when we know nothing is going on.
> >> The mechanism is generic which can be used for any write back
> >> descriptor.
> >>
> >> This patchset also introduces a few changes into existing power
> >> management-related intrinsics, namely to provide a native way of waking
> >> up a sleeping core without application being responsible for it, as well
> >> as general robustness improvements. There's quite a bit of locking going
> >> on, but these locks are per-thread and very little (if any) contention
> >> is expected, so the performance impact shouldn't be that bad (and in any
> >> case the locking happens when we're about to sleep anyway, not on a
> >> hotpath).
> >>
> >> Why are we putting it into ethdev as opposed to leaving this up to the
> >> application? Our customers specifically requested a way to do it wit
> >> minimal changes to the application code. The current approach allows to
> >> just flip a switch and automatically have power savings.
> >>
> >> - Only 1:1 core to queue mapping is supported, meaning that each lcore
> >>    must at most handle RX on a single queue
> >> - Support 3 type policies. Monitor/Pause/Frequency Scaling
> >> - Power management is enabled per-queue
> >> - The API doesn't extend to other device types
> >
> > Fyi, ovsrobot Travis being KO, you probably missed that GHA CI caught this:
> > https://github.com/ovsrobot/dpdk/runs/1571056574?check_suite_focus=true#step:13:16082
> >
> > We will have to put an exception on driver only ABI.
> >
> >
>
> Why does aarch64 build fail there? The functions in question are in the
> version map file, but the build complains that they aren't.

From what I can see, this series puts rte_power_* symbols in a .h.
So it will be seen as symbols exported by any library including such a header.

The check then complains about this as it sees exported symbols
unknown of the library version.map.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v12 00/11] Add PMD power management
  2021-01-11  8:44           ` David Marchand
@ 2021-01-11  8:52             ` David Marchand
  2021-01-11 10:21               ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: David Marchand @ 2021-01-11  8:52 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Gage Eads,
	Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara,
	Ray Kinsella, Yigit, Ferruh

On Mon, Jan 11, 2021 at 9:44 AM David Marchand
<david.marchand@redhat.com> wrote:
>
> On Fri, Jan 8, 2021 at 5:42 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
> > Why does aarch64 build fail there? The functions in question are in the
> > version map file, but the build complains that they aren't.
>
> From what I can see, this series puts rte_power_* symbols in a .h.
> So it will be seen as symbols exported by any library including such a header.
>
> The check then complains about this as it sees exported symbols
> unknown of the library version.map.

Quick fix:

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h
b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index 39e49cc45b..9e498e9ebf 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -13,35 +13,6 @@ extern "C" {

 #include "generic/rte_power_intrinsics.h"

-/**
- * This function is not supported on ARM.
- */
-void
-rte_power_monitor(const struct rte_power_monitor_cond *pmc,
-               const uint64_t tsc_timestamp)
-{
-       RTE_SET_USED(pmc);
-       RTE_SET_USED(tsc_timestamp);
-}
-
-/**
- * This function is not supported on ARM.
- */
-void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-       RTE_SET_USED(tsc_timestamp);
-}
-
-/**
- * This function is not supported on ARM.
- */
-void
-rte_power_monitor_wakeup(const unsigned int lcore_id)
-{
-       RTE_SET_USED(lcore_id);
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build
index d62875ebae..6ec53ea03a 100644
--- a/lib/librte_eal/arm/meson.build
+++ b/lib/librte_eal/arm/meson.build
@@ -7,4 +7,5 @@ sources += files(
        'rte_cpuflags.c',
        'rte_cycles.c',
        'rte_hypervisor.c',
+       'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c
b/lib/librte_eal/arm/rte_power_intrinsics.c
new file mode 100644
index 0000000000..998f9898ad
--- /dev/null
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -0,0 +1,31 @@
+#include <rte_common.h>
+#include <rte_power_intrinsics.h>
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+               const uint64_t tsc_timestamp)
+{
+       RTE_SET_USED(pmc);
+       RTE_SET_USED(tsc_timestamp);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+       RTE_SET_USED(tsc_timestamp);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+       RTE_SET_USED(lcore_id);
+}


HTH.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v12 00/11] Add PMD power management
  2021-01-11  8:52             ` David Marchand
@ 2021-01-11 10:21               ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-11 10:21 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Gage Eads,
	Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara,
	Ray Kinsella, Yigit, Ferruh

On 11-Jan-21 8:52 AM, David Marchand wrote:
> On Mon, Jan 11, 2021 at 9:44 AM David Marchand
> <david.marchand@redhat.com> wrote:
>>
>> On Fri, Jan 8, 2021 at 5:42 PM Burakov, Anatoly
>> <anatoly.burakov@intel.com> wrote:
>>> Why does aarch64 build fail there? The functions in question are in the
>>> version map file, but the build complains that they aren't.
>>
>>  From what I can see, this series puts rte_power_* symbols in a .h.
>> So it will be seen as symbols exported by any library including such a header.
>>
>> The check then complains about this as it sees exported symbols
>> unknown of the library version.map.
> 
> Quick fix:
> 
> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h
> b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> index 39e49cc45b..9e498e9ebf 100644
> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> @@ -13,35 +13,6 @@ extern "C" {
> 
>   #include "generic/rte_power_intrinsics.h"
> 
> -/**
> - * This function is not supported on ARM.
> - */
> -void
> -rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> -               const uint64_t tsc_timestamp)
> -{
> -       RTE_SET_USED(pmc);
> -       RTE_SET_USED(tsc_timestamp);
> -}
> -
> -/**
> - * This function is not supported on ARM.
> - */
> -void
> -rte_power_pause(const uint64_t tsc_timestamp)
> -{
> -       RTE_SET_USED(tsc_timestamp);
> -}
> -
> -/**
> - * This function is not supported on ARM.
> - */
> -void
> -rte_power_monitor_wakeup(const unsigned int lcore_id)
> -{
> -       RTE_SET_USED(lcore_id);
> -}
> -
>   #ifdef __cplusplus
>   }
>   #endif
> diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build
> index d62875ebae..6ec53ea03a 100644
> --- a/lib/librte_eal/arm/meson.build
> +++ b/lib/librte_eal/arm/meson.build
> @@ -7,4 +7,5 @@ sources += files(
>          'rte_cpuflags.c',
>          'rte_cycles.c',
>          'rte_hypervisor.c',
> +       'rte_power_intrinsics.c',
>   )
> diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c
> b/lib/librte_eal/arm/rte_power_intrinsics.c
> new file mode 100644
> index 0000000000..998f9898ad
> --- /dev/null
> +++ b/lib/librte_eal/arm/rte_power_intrinsics.c
> @@ -0,0 +1,31 @@
> +#include <rte_common.h>
> +#include <rte_power_intrinsics.h>
> +
> +/**
> + * This function is not supported on ARM.
> + */
> +void
> +rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> +               const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(pmc);
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +/**
> + * This function is not supported on ARM.
> + */
> +void
> +rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +/**
> + * This function is not supported on ARM.
> + */
> +void
> +rte_power_monitor_wakeup(const unsigned int lcore_id)
> +{
> +       RTE_SET_USED(lcore_id);
> +}
> 
> 
> HTH.
> 
> 

OK, will add into v14 so. Thanks!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in power intrinsics
  2021-01-08 19:58           ` Stephen Hemminger
@ 2021-01-11 10:21             ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-11 10:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, Bruce Richardson, Konstantin Ananyev, thomas, gage.eads,
	timothy.mcdaniel, david.hunt, chris.macnamara

On 08-Jan-21 7:58 PM, Stephen Hemminger wrote:
> On Fri,  8 Jan 2021 17:42:05 +0000
> Anatoly Burakov <anatoly.burakov@intel.com> wrote:
> 
>> +static uint8_t wait_supported;
> 
> Since it is being used as a flag, bool is more common usage.
> 

Will fix in v14, thanks!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 00/11] Add PMD power management
  2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
                           ` (10 preceding siblings ...)
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2021-01-11 14:35         ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 01/11] eal: uninline power intrinsics Anatoly Burakov
                             ` (11 more replies)
  11 siblings, 12 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt,
	bruce.richardson, chris.macnamara

This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. There are multiple proposed mechanisms to achieve said power
savings: simple frequency scaling, idle loop, and monitoring the Rx
queue for incoming packages. The latter is achieved through cooperation
with the NIC driver that will allow us to know address of wake up event,
and wait for writes on that address.

On IA, this is achieved through using UMONITOR/UMWAIT instructions. They 
are used in their raw opcode form because there is no widespread 
compiler support for them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen to implement 
similar instructions.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we employ
one of the suggested power management schemes automatically, from within
a Rx callback inside the PMD. Once there's traffic again, the empty poll
counter is reset.

This patchset also introduces a few changes into existing power 
management-related intrinsics, namely to provide a native way of waking 
up a sleeping core without application being responsible for it, as well 
as general robustness improvements. There's quite a bit of locking going 
on, but these locks are per-thread and very little (if any) contention 
is expected, so the performance impact shouldn't be that bad (and in any 
case the locking happens when we're about to sleep anyway).

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it with
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

Things of note:

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

14:
- Fixed ARM/PPC builds
- Addressed various review comments

v13:
- Reworked the librte_power code to require less locking and handle invalid
  parameters better
- Fix numerous rebase errors present in v12

v12:
- Rebase on top of 21.02
- Rework of power intrinsics code

Anatoly Burakov (5):
  eal: uninline power intrinsics
  eal: avoid invalid API usage in power intrinsics
  eal: change API of power intrinsics
  eal: remove sync version of power monitor
  eal: add monitor wakeup function

Liang Ma (6):
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  44 +++
 doc/guides/rel_notes/release_21_02.rst        |  15 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 drivers/event/dlb/dlb.c                       |  10 +-
 drivers/event/dlb2/dlb2.c                     |  10 +-
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  25 ++
 drivers/net/i40e/i40e_rxtx.h                  |   1 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   1 +
 examples/l3fwd-power/main.c                   |  89 ++++-
 .../arm/include/rte_power_intrinsics.h        |  40 --
 lib/librte_eal/arm/meson.build                |   1 +
 lib/librte_eal/arm/rte_power_intrinsics.c     |  34 ++
 .../include/generic/rte_power_intrinsics.h    |  78 ++--
 .../ppc/include/rte_power_intrinsics.h        |  40 --
 lib/librte_eal/ppc/meson.build                |   1 +
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  34 ++
 lib/librte_eal/version.map                    |   5 +
 .../x86/include/rte_power_intrinsics.h        | 115 ------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 184 +++++++++
 lib/librte_ethdev/rte_ethdev.c                |  28 ++
 lib/librte_ethdev/rte_ethdev.h                |  25 ++
 lib/librte_ethdev/rte_ethdev_driver.h         |  22 ++
 lib/librte_ethdev/version.map                 |   3 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 360 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  90 +++++
 lib/librte_power/version.map                  |   5 +
 34 files changed, 1097 insertions(+), 259 deletions(-)
 create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 01/11] eal: uninline power intrinsics
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 02/11] eal: avoid invalid API usage in " Anatoly Burakov
                             ` (10 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, power intrinsics are inline functions. Make them part of the
ABI so that we can have various internal data associated with them
without exposing said data to the outside world.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v14:
    - Fix compile issues on ARM and PPC64 by moving implementations to .c files

 .../arm/include/rte_power_intrinsics.h        |  40 ------
 lib/librte_eal/arm/meson.build                |   1 +
 lib/librte_eal/arm/rte_power_intrinsics.c     |  42 ++++++
 .../include/generic/rte_power_intrinsics.h    |   6 +-
 .../ppc/include/rte_power_intrinsics.h        |  40 ------
 lib/librte_eal/ppc/meson.build                |   1 +
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  42 ++++++
 lib/librte_eal/version.map                    |   5 +
 .../x86/include/rte_power_intrinsics.h        | 115 -----------------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 120 ++++++++++++++++++
 11 files changed, 215 insertions(+), 198 deletions(-)
 create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index a4a1bc1159..9e498e9ebf 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -13,46 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	RTE_SET_USED(tsc_timestamp);
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build
index d62875ebae..6ec53ea03a 100644
--- a/lib/librte_eal/arm/meson.build
+++ b/lib/librte_eal/arm/meson.build
@@ -7,4 +7,5 @@ sources += files(
 	'rte_cpuflags.c',
 	'rte_cycles.c',
 	'rte_hypervisor.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
new file mode 100644
index 0000000000..e5a49facb4
--- /dev/null
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2021 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ */
+void rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index dd520d90fa..67977bd511 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -52,7 +52,7 @@
  *   to undefined result.
  */
 __rte_experimental
-static inline void rte_power_monitor(const volatile void *p,
+void rte_power_monitor(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz);
 
@@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p,
  *   wakes up.
  */
 __rte_experimental
-static inline void rte_power_monitor_sync(const volatile void *p,
+void rte_power_monitor_sync(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz,
 		rte_spinlock_t *lck);
@@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  *   architecture-dependent.
  */
 __rte_experimental
-static inline void rte_power_pause(const uint64_t tsc_timestamp);
+void rte_power_pause(const uint64_t tsc_timestamp);
 
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 4ed03d521f..c0e9ac279f 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -13,46 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	RTE_SET_USED(tsc_timestamp);
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/ppc/meson.build b/lib/librte_eal/ppc/meson.build
index f4b6d95c42..43c46542fb 100644
--- a/lib/librte_eal/ppc/meson.build
+++ b/lib/librte_eal/ppc/meson.build
@@ -7,4 +7,5 @@ sources += files(
 	'rte_cpuflags.c',
 	'rte_cycles.c',
 	'rte_hypervisor.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
new file mode 100644
index 0000000000..785effabe6
--- /dev/null
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2021 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ */
+void rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		       const uint64_t value_mask, const uint64_t tsc_timestamp,
+		       const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+			    const uint64_t value_mask, const uint64_t tsc_timestamp,
+			    const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 354c068f31..31bf76ae81 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -403,6 +403,11 @@ EXPERIMENTAL {
 	rte_service_lcore_may_be_active;
 	rte_vect_get_max_simd_bitwidth;
 	rte_vect_set_max_simd_bitwidth;
+
+	# added in 21.02
+	rte_power_monitor;
+	rte_power_monitor_sync;
+	rte_power_pause;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
index c7d790c854..e4c2b87f73 100644
--- a/lib/librte_eal/x86/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -13,121 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-static inline uint64_t
-__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz)
-{
-	switch (sz) {
-	case sizeof(uint8_t):
-		return *(const volatile uint8_t *)p;
-	case sizeof(uint16_t):
-		return *(const volatile uint16_t *)p;
-	case sizeof(uint32_t):
-		return *(const volatile uint32_t *)p;
-	case sizeof(uint64_t):
-		return *(const volatile uint64_t *)p;
-	default:
-		/* this is an intrinsic, so we can't have any error handling */
-		RTE_ASSERT(0);
-		return 0;
-	}
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-}
-
-/**
- * This function uses TPAUSE instruction  and will enter C0.2 state. For more
- * information about usage of this instruction, please refer to Intel(R) 64 and
- * IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* execute TPAUSE */
-	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
-		: /* ignore rflags */
-		: "D"(0), /* enter C0.2 */
-		  "a"(tsc_l), "d"(tsc_h));
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build
index e78f29002e..dfd42dee0c 100644
--- a/lib/librte_eal/x86/meson.build
+++ b/lib/librte_eal/x86/meson.build
@@ -8,4 +8,5 @@ sources += files(
 	'rte_cycles.c',
 	'rte_hypervisor.c',
 	'rte_spinlock.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
new file mode 100644
index 0000000000..34c5fd9c3e
--- /dev/null
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -0,0 +1,120 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+static inline uint64_t
+__get_umwait_val(const volatile void *p, const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):
+		return *(const volatile uint8_t *)p;
+	case sizeof(uint16_t):
+		return *(const volatile uint16_t *)p;
+	case sizeof(uint32_t):
+		return *(const volatile uint32_t *)p;
+	case sizeof(uint64_t):
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		RTE_ASSERT(0);
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			"a"(tsc_l), "d"(tsc_h));
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 02/11] eal: avoid invalid API usage in power intrinsics
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 01/11] eal: uninline power intrinsics Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 03/11] eal: change API of " Anatoly Burakov
                             ` (9 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel,
	david.hunt, chris.macnamara

Currently, the API documentation mandates that if the user wants to use
the power management intrinsics, they need to call the
`rte_cpu_get_intrinsics_support` API and check support for specific
intrinsics.

However, if the user does not do that, it is possible to get illegal
instruction error because we're using raw instruction opcodes, which may
or may not be supported at runtime.

Now that we have everything in a C file, we can check for support at
startup and prevent the user from possibly encountering illegal
instruction errors.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v14:
    - Replace uint8_t with bool

 .../include/generic/rte_power_intrinsics.h    |  3 --
 lib/librte_eal/x86/rte_power_intrinsics.c     | 31 +++++++++++++++++--
 2 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 67977bd511..ffa72f7578 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -34,7 +34,6 @@
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -75,7 +74,6 @@ void rte_power_monitor(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -111,7 +109,6 @@ void rte_power_monitor_sync(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 34c5fd9c3e..050ae612a8 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,8 @@
 
 #include "rte_power_intrinsics.h"
 
+static bool wait_supported;
+
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
 {
@@ -35,6 +37,11 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -72,6 +79,11 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -112,9 +124,22 @@ rte_power_pause(const uint64_t tsc_timestamp)
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
 
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/* execute TPAUSE */
 	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			"a"(tsc_l), "d"(tsc_h));
+		: /* ignore rflags */
+		: "D"(0), /* enter C0.2 */
+		  "a"(tsc_l), "d"(tsc_h));
+}
+
+RTE_INIT(rte_power_intrinsics_init) {
+	struct rte_cpu_intrinsics i;
+
+	rte_cpu_get_intrinsics_support(&i);
+
+	if (i.power_monitor && i.power_pause)
+		wait_supported = 1;
 }
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 03/11] eal: change API of power intrinsics
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 01/11] eal: uninline power intrinsics Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 02/11] eal: avoid invalid API usage in " Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 04/11] eal: remove sync version of power monitor Anatoly Burakov
                             ` (8 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Timothy McDaniel, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
	David Christensen, Bruce Richardson, Konstantin Ananyev, thomas,
	david.hunt, chris.macnamara

Instead of passing around pointers and integers, collect everything
into struct. This makes API design around these intrinsics much easier.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/event/dlb/dlb.c                       | 10 ++--
 drivers/event/dlb2/dlb2.c                     | 10 ++--
 lib/librte_eal/arm/rte_power_intrinsics.c     | 25 ++++------
 .../include/generic/rte_power_intrinsics.h    | 49 ++++++++-----------
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 25 ++++------
 lib/librte_eal/x86/rte_power_intrinsics.c     | 32 ++++++------
 6 files changed, 70 insertions(+), 81 deletions(-)

diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c
index 0c95c4793d..d2f2026291 100644
--- a/drivers/event/dlb/dlb.c
+++ b/drivers/event/dlb/dlb.c
@@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		/* Interrupts not supported by PF PMD */
 		return 1;
 	} else if (dlb->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 86724863f2..c9a8a02278 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 	if (elapsed_ticks >= timeout) {
 		return 1;
 	} else if (dlb2->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb2_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index e5a49facb4..f2c3506b90 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -7,36 +7,31 @@
 /**
  * This function is not supported on ARM.
  */
-void rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+void
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on ARM.
  */
-void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+void
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on ARM.
  */
-void rte_power_pause(const uint64_t tsc_timestamp)
+void
+rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
 }
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index ffa72f7578..00c670cb50 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -18,6 +18,18 @@
  * which are architecture-dependent.
  */
 
+struct rte_power_monitor_cond {
+	volatile void *addr;  /**< Address to monitor for changes */
+	uint64_t val;         /**< Before attempting the monitoring, the address
+	                       *   may be read and compared against this value.
+	                       **/
+	uint64_t mask;   /**< 64-bit mask to extract current value from addr */
+	uint8_t data_sz; /**< Data size (in bytes) that will be used to compare
+	                  *   expected value with the memory address. Can be 1,
+	                  *   2, 4, or 8. Supplying any other value will lead to
+	                  *   undefined result. */
+};
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
@@ -35,25 +47,15 @@
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  */
 __rte_experimental
-void rte_power_monitor(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz);
+void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp);
 
 /**
  * @warning
@@ -75,30 +77,19 @@ void rte_power_monitor(const volatile void *p,
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  * @param lck
  *   A spinlock that must be locked before entering the function, will be
  *   unlocked while the CPU is sleeping, and will be locked again once the CPU
  *   wakes up.
  */
 __rte_experimental
-void rte_power_monitor_sync(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz,
-		rte_spinlock_t *lck);
+void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
 
 /**
  * @warning
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 785effabe6..3897d2024d 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -7,36 +7,31 @@
 /**
  * This function is not supported on PPC64.
  */
-void rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		       const uint64_t value_mask, const uint64_t tsc_timestamp,
-		       const uint8_t data_sz)
+void
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on PPC64.
  */
-void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-			    const uint64_t value_mask, const uint64_t tsc_timestamp,
-			    const uint8_t data_sz, rte_spinlock_t *lck)
+void
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on PPC64.
  */
-void rte_power_pause(const uint64_t tsc_timestamp)
+void
+rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
 }
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 050ae612a8..fd061e8e50 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -31,9 +31,8 @@ __get_umwait_val(const volatile void *p, const uint8_t sz)
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -50,14 +49,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return;
 	}
 	/* execute UMWAIT */
@@ -73,9 +73,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -92,14 +91,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return;
 	}
 	rte_spinlock_unlock(lck);
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 04/11] eal: remove sync version of power monitor
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
                             ` (2 preceding siblings ...)
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 03/11] eal: change API of " Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 05/11] eal: add monitor wakeup function Anatoly Burakov
                             ` (7 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, the "sync" version of power monitor intrinsic is supposed to
be used for purposes of waking up a sleeping core. However, there are
better ways to achieve the same result, so remove the unneeded function.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/arm/rte_power_intrinsics.c     | 12 -----
 .../include/generic/rte_power_intrinsics.h    | 34 --------------
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 12 -----
 lib/librte_eal/version.map                    |  1 -
 lib/librte_eal/x86/rte_power_intrinsics.c     | 46 -------------------
 5 files changed, 105 deletions(-)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index f2c3506b90..6b8219b919 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -15,18 +15,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	RTE_SET_USED(tsc_timestamp);
 }
 
-/**
- * This function is not supported on ARM.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-}
-
 /**
  * This function is not supported on ARM.
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 00c670cb50..a6f1955996 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -57,40 +57,6 @@ __rte_experimental
 void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
-/**
- * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
- *
- * Monitor specific address for changes. This will cause the CPU to enter an
- * architecture-defined optimized power state until either the specified
- * memory address is written to, a certain TSC timestamp is reached, or other
- * reasons cause the CPU to wake up.
- *
- * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
- * mask is non-zero, the current value pointed to by the `p` pointer will be
- * checked against the expected value, and if they match, the entering of
- * optimized power state may be aborted.
- *
- * This call will also lock a spinlock on entering sleep, and release it on
- * waking up the CPU.
- *
- * @warning It is responsibility of the user to check if this function is
- *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *
- * @param pmc
- *   The monitoring condition structure.
- * @param tsc_timestamp
- *   Maximum TSC timestamp to wait for. Note that the wait behavior is
- *   architecture-dependent.
- * @param lck
- *   A spinlock that must be locked before entering the function, will be
- *   unlocked while the CPU is sleeping, and will be locked again once the CPU
- *   wakes up.
- */
-__rte_experimental
-void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
-
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 3897d2024d..9a40c4d5d6 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -15,18 +15,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	RTE_SET_USED(tsc_timestamp);
 }
 
-/**
- * This function is not supported on PPC64.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-}
-
 /**
  * This function is not supported on PPC64.
  */
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 31bf76ae81..20945b1efa 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,7 +406,6 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
-	rte_power_monitor_sync;
 	rte_power_pause;
 };
 
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index fd061e8e50..14c88600f0 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -67,52 +67,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 			  "a"(tsc_l), "d"(tsc_h));
 }
 
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* prevent user from running this instruction if it's not supported */
-	if (!wait_supported)
-		return;
-
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(pmc->addr));
-
-	if (pmc->mask) {
-		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->data_sz);
-		const uint64_t masked = cur_value & pmc->mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
-			return;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-}
-
 /**
  * This function uses TPAUSE instruction  and will enter C0.2 state. For more
  * information about usage of this instruction, please refer to Intel(R) 64 and
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 05/11] eal: add monitor wakeup function
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
                             ` (3 preceding siblings ...)
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 04/11] eal: remove sync version of power monitor Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 06/11] ethdev: add simple power management API Anatoly Burakov
                             ` (6 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Now that we have everything in a C file, we can store the information
about our sleep, and have a native mechanism to wake up the sleeping
core. This mechanism would however only wake up a core that's sleeping
while monitoring - waking up from `rte_power_pause` won't work.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v13:
    - Add comments around wakeup code to explain what it does
    - Add lcore_id parameter checking to prevent buffer overrun

 lib/librte_eal/arm/rte_power_intrinsics.c     |  9 ++
 .../include/generic/rte_power_intrinsics.h    | 16 ++++
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  9 ++
 lib/librte_eal/version.map                    |  1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 85 +++++++++++++++++++
 5 files changed, 120 insertions(+)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index 6b8219b919..14081a2c5b 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -23,3 +23,12 @@ rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
 }
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+}
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index a6f1955996..e311d6f8ea 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -57,6 +57,22 @@ __rte_experimental
 void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Wake up a specific lcore that is in a power optimized state and is monitoring
+ * an address.
+ *
+ * @note This function will *not* wake up a core that is in a power optimized
+ *   state due to calling `rte_power_pause`.
+ *
+ * @param lcore_id
+ *   Lcore ID of a sleeping thread.
+ */
+__rte_experimental
+void rte_power_monitor_wakeup(const unsigned int lcore_id);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 9a40c4d5d6..a7db61a7c3 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -23,3 +23,12 @@ rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
 }
+
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 20945b1efa..ac026e289d 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,6 +406,7 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
+	rte_power_monitor_wakeup;
 	rte_power_pause;
 };
 
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 14c88600f0..7cbe156199 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -2,8 +2,31 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
+#include <rte_common.h>
+#include <rte_lcore.h>
+#include <rte_spinlock.h>
+
 #include "rte_power_intrinsics.h"
 
+/*
+ * Per-lcore structure holding current status of C0.2 sleeps.
+ */
+static struct power_wait_status {
+	rte_spinlock_t lock;
+	volatile void *monitor_addr; /**< NULL if not currently sleeping */
+} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+
+static inline void
+__umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	/* trigger a write but don't change the value */
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
 static bool wait_supported;
 
 static inline uint64_t
@@ -36,6 +59,12 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s;
+
+	/* prevent non-EAL thread from using this API */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
 
 	/* prevent user from running this instruction if it's not supported */
 	if (!wait_supported)
@@ -60,11 +89,24 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		if (masked == pmc->val)
 			return;
 	}
+
+	s = &wait_status[lcore_id];
+
+	/* update sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = pmc->addr;
+	rte_spinlock_unlock(&s->lock);
+
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			  "a"(tsc_l), "d"(tsc_h));
+
+	/* erase sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = NULL;
+	rte_spinlock_unlock(&s->lock);
 }
 
 /**
@@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) {
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
 }
+
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	struct power_wait_status *s;
+
+	/* prevent buffer overrun */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
+	s = &wait_status[lcore_id];
+
+	/*
+	 * There is a race condition between sleep, wakeup and locking, but we
+	 * don't need to handle it.
+	 *
+	 * Possible situations:
+	 *
+	 * 1. T1 locks, sets address, unlocks
+	 * 2. T2 locks, triggers wakeup, unlocks
+	 * 3. T1 sleeps
+	 *
+	 * In this case, because T1 has already set the address for monitoring,
+	 * we will wake up immediately even if T2 triggers wakeup before T1
+	 * goes to sleep.
+	 *
+	 * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up
+	 * 2. T2 locks, triggers wakeup, and unlocks
+	 * 3. T1 locks, erases address, and unlocks
+	 *
+	 * In this case, since we've already woken up, the "wakeup" was
+	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
+	 * wakeup address is still valid so it's perfectly safe to write it.
+	 */
+	rte_spinlock_lock(&s->lock);
+	if (s->monitor_addr != NULL)
+		__umwait_wakeup(s->monitor_addr);
+	rte_spinlock_unlock(&s->lock);
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 06/11] ethdev: add simple power management API
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
                             ` (4 preceding siblings ...)
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 05/11] eal: add monitor wakeup function Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 07/11] power: add PMD power management API and callback Anatoly Burakov
                             ` (5 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, konstantin.ananyev, timothy.mcdaniel,
	david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple API to allow getting the monitor conditions for
power-optimized monitoring of the Rx queues from the PMD, as well as
release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---

Notes:
    v13:
    - Fix typos and issues raised by Andrew

 doc/guides/rel_notes/release_21_02.rst |  5 +++++
 lib/librte_ethdev/rte_ethdev.c         | 28 ++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h         | 25 +++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h  | 22 ++++++++++++++++++++
 lib/librte_ethdev/version.map          |  3 +++
 5 files changed, 83 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index 638f98168b..6de0cb568e 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **ethdev: added new API for PMD power management**
+
+  * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
+    ``rte_power_monitor()`` to enable automatic power management for PMD's.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 17ddacc78d..e19dbd838b 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	if (pmc == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n",
+				pmc);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id],
+			pmc));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index f5f8919186..ca0f91312e 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -4334,6 +4335,30 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Retrieve the monitor condition for a given receive queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param pmc
+ *   The pointer point to power-optimized monitoring condition structure.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index 0eacfd8425..3b3b0ec1a0 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
 	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
 /**< @internal Unbind peer queue from the current queue. */
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an Rx queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param pmc
+ *   The pointer to power-optimized monitoring condition structure.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_monitor_addr_t)(void *rxq,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -917,6 +937,8 @@ struct eth_dev_ops {
 	/**< Set up the connection between the pair of hairpin queues. */
 	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
 	/**< Disconnect the hairpin queues of a pair from each other. */
+	eth_get_monitor_addr_t get_monitor_addr;
+	/**< Get power monitoring condition for Rx queue. */
 };
 
 /**
diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
index d3f5410806..a124e1e370 100644
--- a/lib/librte_ethdev/version.map
+++ b/lib/librte_ethdev/version.map
@@ -240,6 +240,9 @@ EXPERIMENTAL {
 	rte_flow_get_restore_info;
 	rte_flow_tunnel_action_decap_release;
 	rte_flow_tunnel_item_release;
+
+	# added in 21.02
+	rte_eth_get_monitor_addr;
 };
 
 INTERNAL {
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 07/11] power: add PMD power management API and callback
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
                             ` (5 preceding siblings ...)
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 06/11] ethdev: add simple power management API Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 08/11] net/ixgbe: implement power management API Anatoly Burakov
                             ` (4 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas,
	konstantin.ananyev, timothy.mcdaniel, bruce.richardson,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. TPAUSE/Pause instruction

   This method uses the pause (or TPAUSE, if available) instruction to
   avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v13:
    - Rework the synchronization mechanism to not require locking
    - Add more parameter checking
    - Rework n_rx_queues access to not go through internal PMD structures and use
      public API instead

 doc/guides/prog_guide/power_man.rst    |  44 +++
 doc/guides/rel_notes/release_21_02.rst |  10 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 360 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  90 +++++++
 lib/librte_power/version.map           |   5 +
 6 files changed, 512 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..02280dd689 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,47 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * Monitor
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX
+   descriptor address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not available, use ``rte_pause()``.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing ``librte_power`` library
+   functionality to scale the core frequency up/down depending on traffic
+   volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +241,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index 6de0cb568e..b34828cad6 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -60,6 +60,16 @@ New Features
   * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
     ``rte_power_monitor()`` to enable automatic power management for PMD's.
 
+* **Add PMD power management helper API**
+
+  A new helper API has been added to make using Ethernet PMD power management
+  easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power
+  management schemes are supported initially:
+
+  * Power saving based on UMWAIT instruction (x86 only)
+  * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only)
+  * Power saving based on frequency scaling through the ``librte_power`` library
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 4b4cf1b90b..51a471b669 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..65597d354c
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,360 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+static struct pmd_conf_data {
+	struct rte_cpu_intrinsics intrinsics_support;
+	/**< what do we support? */
+	uint64_t tsc_per_us;
+	/**< pre-calculated tsc diff for 1us */
+	uint64_t pause_per_us;
+	/**< how many rte_pause can we fit in a microisecond? */
+} global_data;
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+	/** Device powermanagement status is about to change. */
+	PMD_MGMT_BUSY
+};
+
+struct pmd_queue_cfg {
+	volatile enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	volatile bool umwait_in_progress;
+	/**< are we currently sleeping? */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+static void
+calc_tsc(void)
+{
+	const uint64_t hz = rte_get_timer_hz();
+	const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */
+
+	global_data.tsc_per_us = tsc_per_us;
+
+	/* only do this if we don't have tpause */
+	if (!global_data.intrinsics_support.power_pause) {
+		const uint64_t start = rte_rdtsc_precise();
+		const uint32_t n_pauses = 10000;
+		double us, us_per_pause;
+		uint64_t end;
+		unsigned int i;
+
+		/* estimate number of rte_pause() calls per us*/
+		for (i = 0; i < n_pauses; i++)
+			rte_pause();
+
+		end = rte_rdtsc_precise();
+		us = (end - start) / (double)tsc_per_us;
+		us_per_pause = us / n_pauses;
+
+		global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause);
+	}
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc;
+			uint16_t ret;
+
+			/*
+			 * we might get a cancellation request while being
+			 * inside the callback, in which case the wakeup
+			 * wouldn't work because it would've arrived too early.
+			 *
+			 * to get around this, we notify the other thread that
+			 * we're sleeping, so that it can spin until we're done.
+			 * unsolicited wakeups are perfectly safe.
+			 */
+			q_conf->umwait_in_progress = true;
+
+			/* check if we need to cancel sleep */
+			if (q_conf->pwr_mgmt_state != PMD_MGMT_ENABLED) {
+				/* use monitoring condition to sleep */
+				ret = rte_eth_get_monitor_addr(port_id, qidx,
+						&pmc);
+				if (ret == 0)
+					rte_power_monitor(&pmc, -1ULL);
+			}
+			q_conf->umwait_in_progress = false;
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/* use tpause if we have it */
+			if (global_data.intrinsics_support.power_pause) {
+				const uint64_t cur = rte_rdtsc();
+				const uint64_t wait_tsc =
+						cur + global_data.tsc_per_us;
+				rte_power_pause(wait_tsc);
+			} else {
+				uint64_t i;
+				for (i = 0; i < global_data.pause_per_us; i++)
+					rte_pause();
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	struct rte_eth_dev_info info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if (rte_eth_dev_info_get(port_id, &info) < 0) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* check if queue id is valid */
+	if (queue_id >= info.nb_rx_queues) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* we're about to change our state */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
+
+	/* we need this in various places */
+	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		struct rte_power_monitor_cond dummy;
+
+		/* check if rte_power_monitor is supported */
+		if (!global_data.intrinsics_support.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_monitor_addr(port_id, queue_id,
+				&dummy) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->umwait_in_progress = false;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto rollback;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* figure out various time-to-tsc conversions */
+		if (global_data.tsc_per_us == 0)
+			calc_tsc();
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+	return ret;
+
+rollback:
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	/* let the callback know we're shutting down */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		bool exit = false;
+		do {
+			/*
+			 * we may request cancellation while the other thread
+			 * has just entered the callback but hasn't started
+			 * sleeping yet, so keep waking it up until we know it's
+			 * done sleeping.
+			 */
+			if (queue_cfg->umwait_in_progress)
+				rte_power_monitor_wakeup(lcore_id);
+			else
+				exit = true;
+		} while (!exit);
+	}
+	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..0bfbc6ba69
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** Use power-optimized monitoring to wait for incoming traffic */
+	RTE_POWER_MGMT_TYPE_MONITOR = 1,
+	/** Use power-optimized sleep to avoid busy polling */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Use frequency scaling when traffic is low */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..61996b4d11 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,9 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+
+	# added in 21.02
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 08/11] net/ixgbe: implement power management API
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
                             ` (6 preceding siblings ...)
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 07/11] power: add PMD power management API and callback Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 09/11] net/i40e: " Anatoly Burakov
                             ` (3 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jeff Guo, Haiyue Wang, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 9a47a8b262..4b7a5ca60b 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -560,6 +560,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_monitor_addr     = ixgbe_get_monitor_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 6cfbb582e2..7e046a1819 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int
+ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	pmc->data_sz = sizeof(uint32_t);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 6d2f7c9da3..8a25e98df6 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,6 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 09/11] net/i40e: implement power management API
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
                             ` (7 preceding siblings ...)
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 08/11] net/ixgbe: implement power management API Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 10/11] net/ice: " Anatoly Burakov
                             ` (2 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Beilei Xing, Jeff Guo, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index f54769c29d..af2577a140 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -510,6 +510,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_monitor_addr             = i40e_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5df9a9df56..0b4220fc9c 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -72,6 +72,31 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	pmc->data_sz = sizeof(uint64_t);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..e1494525ce 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,7 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 10/11] net/ice: implement power management API
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
                             ` (8 preceding siblings ...)
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 09/11] net/i40e: " Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Qiming Yang, Qi Zhang, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  1 +
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 9a5d6a559f..c21682c120 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_monitor_addr             = ice_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 5fbd68eafc..fa9e9a235b 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int
+ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	pmc->data_sz = sizeof(uint16_t);
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 6b16716063..906fbefdc4 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -263,6 +263,7 @@ uint16_t ice_xmit_pkts_vec_avx512(void *tx_queue, struct rte_mbuf **tx_pkts,
 				  uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v14 11/11] examples/l3fwd-power: enable PMD power mgmt
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
                             ` (9 preceding siblings ...)
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 10/11] net/ice: " Anatoly Burakov
@ 2021-01-11 14:35           ` Anatoly Burakov
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, thomas, konstantin.ananyev,
	timothy.mcdaniel, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v12:
    - Allow selecting PMD power management scheme from command-line
    - Enforce 1 core 1 queue rule

 .../sample_app_ug/l3_forward_power_man.rst    | 35 ++++++++
 examples/l3fwd-power/main.c                   | 89 ++++++++++++++++++-
 2 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 85a78a5c1e..aaa9367fae 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
+
+PMD Power Management Mode
+-------------------------
+There is also a traffic-aware operating mode that, instead of using explicit
+power management, will use automatic PMD power management. This mode is limited
+to one queue per core, and has three available power management schemes:
+
+* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a
+  power-optimized state (subject to platform support).
+
+* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid
+  busy looping when there is no traffic.
+
+* ``scale`` - this will use frequency scaling routines available in the
+  ``librte_power`` library.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK
+Programmer's Guide for more details on PMD power management.
+
+.. code-block:: console
+
+        ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 995a3b6ad7..e312b6f355 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,11 +200,14 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
 
+static enum rte_power_pmd_mgmt_type pmgmt_type;
+
 enum freq_scale_hint_t
 {
 	FREQ_LOWER    =      -1,
@@ -1611,7 +1615,9 @@ print_usage(const char *prgname)
 		" follow (training_flag, high_threshold, med_threshold)\n"
 		" --telemetry: enable telemetry mode, to update"
 		" empty polls, full polls, and core busyness to telemetry\n"
-		" --interrupt-only: enable interrupt-only mode\n",
+		" --interrupt-only: enable interrupt-only mode\n"
+		" --pmd-mgmt MODE: enable PMD power management mode. "
+		"Currently supported modes: monitor, pause, scale\n",
 		prgname);
 }
 
@@ -1701,6 +1707,32 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+
+static int
+parse_pmd_mgmt_config(const char *name)
+{
+#define PMD_MGMT_MONITOR "monitor"
+#define PMD_MGMT_PAUSE   "pause"
+#define PMD_MGMT_SCALE   "scale"
+
+	if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE;
+		return 0;
+	}
+	/* unknown PMD power management mode */
+	return -1;
+}
+
 static int
 parse_ep_config(const char *q_arg)
 {
@@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 1, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				if (parse_pmd_mgmt_config(optarg) < 0) {
+					printf(" Invalid PMD power management mode: %s\n",
+							optarg);
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2671,6 +2722,13 @@ main(int argc, char **argv)
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
+
+		/* PMD power management mode can only do 1 queue per core */
+		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
+			rte_exit(EXIT_FAILURE,
+				"In PMD power management mode, only one queue per lcore is allowed\n");
+		}
+
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2708,6 +2766,16 @@ main(int argc, char **argv)
 					rte_exit(EXIT_FAILURE,
 						 "Fail to add ptype cb\n");
 			}
+
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_pmd_mgmt_queue_enable(
+						lcore_id, portid, queueid,
+						pmgmt_type);
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+						"rte_power_pmd_mgmt_queue_enable: err=%d, port=%d\n",
+							ret, portid);
+			}
 		}
 	}
 
@@ -2798,6 +2866,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		/* reuse telemetry loop for PMD power management mode */
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2824,6 +2895,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 00/11] Add PMD power management
  2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
                             ` (10 preceding siblings ...)
  2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2021-01-11 14:58           ` Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics Anatoly Burakov
                               ` (11 more replies)
  11 siblings, 12 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt,
	bruce.richardson, chris.macnamara

This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. There are multiple proposed mechanisms to achieve said power
savings: simple frequency scaling, idle loop, and monitoring the Rx
queue for incoming packages. The latter is achieved through cooperation
with the NIC driver that will allow us to know address of wake up event,
and wait for writes on that address.

On IA, this is achieved through using UMONITOR/UMWAIT instructions. They 
are used in their raw opcode form because there is no widespread 
compiler support for them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen to implement 
similar instructions.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we employ
one of the suggested power management schemes automatically, from within
a Rx callback inside the PMD. Once there's traffic again, the empty poll
counter is reset.

This patchset also introduces a few changes into existing power 
management-related intrinsics, namely to provide a native way of waking 
up a sleeping core without application being responsible for it, as well 
as general robustness improvements. There's quite a bit of locking going 
on, but these locks are per-thread and very little (if any) contention 
is expected, so the performance impact shouldn't be that bad (and in any 
case the locking happens when we're about to sleep anyway).

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it with
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

Things of note:

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

v15:
- Fixed incorrect check in UMWAIT callback
- Fixed accidental whitespace changes

v14:
- Fixed ARM/PPC builds
- Addressed various review comments

v13:
- Reworked the librte_power code to require less locking and handle invalid
  parameters better
- Fix numerous rebase errors present in v12

v12:
- Rebase on top of 21.02
- Rework of power intrinsics code

Anatoly Burakov (5):
  eal: uninline power intrinsics
  eal: avoid invalid API usage in power intrinsics
  eal: change API of power intrinsics
  eal: remove sync version of power monitor
  eal: add monitor wakeup function

Liang Ma (6):
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  44 +++
 doc/guides/rel_notes/release_21_02.rst        |  15 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 drivers/event/dlb/dlb.c                       |  10 +-
 drivers/event/dlb2/dlb2.c                     |  10 +-
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  25 ++
 drivers/net/i40e/i40e_rxtx.h                  |   1 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   1 +
 examples/l3fwd-power/main.c                   |  89 ++++-
 .../arm/include/rte_power_intrinsics.h        |  40 --
 lib/librte_eal/arm/meson.build                |   1 +
 lib/librte_eal/arm/rte_power_intrinsics.c     |  34 ++
 .../include/generic/rte_power_intrinsics.h    |  78 ++--
 .../ppc/include/rte_power_intrinsics.h        |  40 --
 lib/librte_eal/ppc/meson.build                |   1 +
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  34 ++
 lib/librte_eal/version.map                    |   5 +
 .../x86/include/rte_power_intrinsics.h        | 115 ------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 184 +++++++++
 lib/librte_ethdev/rte_ethdev.c                |  28 ++
 lib/librte_ethdev/rte_ethdev.h                |  25 ++
 lib/librte_ethdev/rte_ethdev_driver.h         |  22 ++
 lib/librte_ethdev/version.map                 |   3 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 359 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  90 +++++
 lib/librte_power/version.map                  |   5 +
 34 files changed, 1096 insertions(+), 259 deletions(-)
 create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-12 16:09               ` Ananyev, Konstantin
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 02/11] eal: avoid invalid API usage in " Anatoly Burakov
                               ` (10 subsequent siblings)
  11 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, power intrinsics are inline functions. Make them part of the
ABI so that we can have various internal data associated with them
without exposing said data to the outside world.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v14:
    - Fix compile issues on ARM and PPC64 by moving implementations to .c files

 .../arm/include/rte_power_intrinsics.h        |  40 ------
 lib/librte_eal/arm/meson.build                |   1 +
 lib/librte_eal/arm/rte_power_intrinsics.c     |  42 ++++++
 .../include/generic/rte_power_intrinsics.h    |   6 +-
 .../ppc/include/rte_power_intrinsics.h        |  40 ------
 lib/librte_eal/ppc/meson.build                |   1 +
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  42 ++++++
 lib/librte_eal/version.map                    |   5 +
 .../x86/include/rte_power_intrinsics.h        | 115 -----------------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 120 ++++++++++++++++++
 11 files changed, 215 insertions(+), 198 deletions(-)
 create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index a4a1bc1159..9e498e9ebf 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -13,46 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	RTE_SET_USED(tsc_timestamp);
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build
index d62875ebae..6ec53ea03a 100644
--- a/lib/librte_eal/arm/meson.build
+++ b/lib/librte_eal/arm/meson.build
@@ -7,4 +7,5 @@ sources += files(
 	'rte_cpuflags.c',
 	'rte_cycles.c',
 	'rte_hypervisor.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
new file mode 100644
index 0000000000..e5a49facb4
--- /dev/null
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2021 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ */
+void rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index dd520d90fa..67977bd511 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -52,7 +52,7 @@
  *   to undefined result.
  */
 __rte_experimental
-static inline void rte_power_monitor(const volatile void *p,
+void rte_power_monitor(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz);
 
@@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p,
  *   wakes up.
  */
 __rte_experimental
-static inline void rte_power_monitor_sync(const volatile void *p,
+void rte_power_monitor_sync(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz,
 		rte_spinlock_t *lck);
@@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  *   architecture-dependent.
  */
 __rte_experimental
-static inline void rte_power_pause(const uint64_t tsc_timestamp);
+void rte_power_pause(const uint64_t tsc_timestamp);
 
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 4ed03d521f..c0e9ac279f 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -13,46 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	RTE_SET_USED(tsc_timestamp);
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/ppc/meson.build b/lib/librte_eal/ppc/meson.build
index f4b6d95c42..43c46542fb 100644
--- a/lib/librte_eal/ppc/meson.build
+++ b/lib/librte_eal/ppc/meson.build
@@ -7,4 +7,5 @@ sources += files(
 	'rte_cpuflags.c',
 	'rte_cycles.c',
 	'rte_hypervisor.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
new file mode 100644
index 0000000000..785effabe6
--- /dev/null
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2021 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ */
+void rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		       const uint64_t value_mask, const uint64_t tsc_timestamp,
+		       const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+			    const uint64_t value_mask, const uint64_t tsc_timestamp,
+			    const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 354c068f31..31bf76ae81 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -403,6 +403,11 @@ EXPERIMENTAL {
 	rte_service_lcore_may_be_active;
 	rte_vect_get_max_simd_bitwidth;
 	rte_vect_set_max_simd_bitwidth;
+
+	# added in 21.02
+	rte_power_monitor;
+	rte_power_monitor_sync;
+	rte_power_pause;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
index c7d790c854..e4c2b87f73 100644
--- a/lib/librte_eal/x86/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -13,121 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-static inline uint64_t
-__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz)
-{
-	switch (sz) {
-	case sizeof(uint8_t):
-		return *(const volatile uint8_t *)p;
-	case sizeof(uint16_t):
-		return *(const volatile uint16_t *)p;
-	case sizeof(uint32_t):
-		return *(const volatile uint32_t *)p;
-	case sizeof(uint64_t):
-		return *(const volatile uint64_t *)p;
-	default:
-		/* this is an intrinsic, so we can't have any error handling */
-		RTE_ASSERT(0);
-		return 0;
-	}
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-}
-
-/**
- * This function uses TPAUSE instruction  and will enter C0.2 state. For more
- * information about usage of this instruction, please refer to Intel(R) 64 and
- * IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* execute TPAUSE */
-	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
-		: /* ignore rflags */
-		: "D"(0), /* enter C0.2 */
-		  "a"(tsc_l), "d"(tsc_h));
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build
index e78f29002e..dfd42dee0c 100644
--- a/lib/librte_eal/x86/meson.build
+++ b/lib/librte_eal/x86/meson.build
@@ -8,4 +8,5 @@ sources += files(
 	'rte_cycles.c',
 	'rte_hypervisor.c',
 	'rte_spinlock.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
new file mode 100644
index 0000000000..34c5fd9c3e
--- /dev/null
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -0,0 +1,120 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+static inline uint64_t
+__get_umwait_val(const volatile void *p, const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):
+		return *(const volatile uint8_t *)p;
+	case sizeof(uint16_t):
+		return *(const volatile uint16_t *)p;
+	case sizeof(uint32_t):
+		return *(const volatile uint32_t *)p;
+	case sizeof(uint64_t):
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		RTE_ASSERT(0);
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			"a"(tsc_l), "d"(tsc_h));
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 02/11] eal: avoid invalid API usage in power intrinsics
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 03/11] eal: change API of " Anatoly Burakov
                               ` (9 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel,
	david.hunt, chris.macnamara

Currently, the API documentation mandates that if the user wants to use
the power management intrinsics, they need to call the
`rte_cpu_get_intrinsics_support` API and check support for specific
intrinsics.

However, if the user does not do that, it is possible to get illegal
instruction error because we're using raw instruction opcodes, which may
or may not be supported at runtime.

Now that we have everything in a C file, we can check for support at
startup and prevent the user from possibly encountering illegal
instruction errors.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v15:
    - Remove accidental whitespace changes
    
    v14:
    - Replace uint8_t with bool
    
    v14:
    - Replace uint8_t with bool

 .../include/generic/rte_power_intrinsics.h    |  3 ---
 lib/librte_eal/x86/rte_power_intrinsics.c     | 25 +++++++++++++++++++
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 67977bd511..ffa72f7578 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -34,7 +34,6 @@
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -75,7 +74,6 @@ void rte_power_monitor(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -111,7 +109,6 @@ void rte_power_monitor_sync(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 34c5fd9c3e..a164ad55fc 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,8 @@
 
 #include "rte_power_intrinsics.h"
 
+static bool wait_supported;
+
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
 {
@@ -35,6 +37,11 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -72,6 +79,11 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -112,9 +124,22 @@ rte_power_pause(const uint64_t tsc_timestamp)
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
 
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
 	/* execute TPAUSE */
 	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			"a"(tsc_l), "d"(tsc_h));
 }
+
+RTE_INIT(rte_power_intrinsics_init) {
+	struct rte_cpu_intrinsics i;
+
+	rte_cpu_get_intrinsics_support(&i);
+
+	if (i.power_monitor && i.power_pause)
+		wait_supported = 1;
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 03/11] eal: change API of power intrinsics
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 02/11] eal: avoid invalid API usage in " Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 04/11] eal: remove sync version of power monitor Anatoly Burakov
                               ` (8 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Timothy McDaniel, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
	David Christensen, Bruce Richardson, Konstantin Ananyev, thomas,
	david.hunt, chris.macnamara

Instead of passing around pointers and integers, collect everything
into struct. This makes API design around these intrinsics much easier.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/event/dlb/dlb.c                       | 10 ++--
 drivers/event/dlb2/dlb2.c                     | 10 ++--
 lib/librte_eal/arm/rte_power_intrinsics.c     | 25 ++++------
 .../include/generic/rte_power_intrinsics.h    | 49 ++++++++-----------
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 25 ++++------
 lib/librte_eal/x86/rte_power_intrinsics.c     | 32 ++++++------
 6 files changed, 70 insertions(+), 81 deletions(-)

diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c
index 0c95c4793d..d2f2026291 100644
--- a/drivers/event/dlb/dlb.c
+++ b/drivers/event/dlb/dlb.c
@@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		/* Interrupts not supported by PF PMD */
 		return 1;
 	} else if (dlb->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 86724863f2..c9a8a02278 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 	if (elapsed_ticks >= timeout) {
 		return 1;
 	} else if (dlb2->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb2_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index e5a49facb4..f2c3506b90 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -7,36 +7,31 @@
 /**
  * This function is not supported on ARM.
  */
-void rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+void
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on ARM.
  */
-void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+void
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on ARM.
  */
-void rte_power_pause(const uint64_t tsc_timestamp)
+void
+rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
 }
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index ffa72f7578..00c670cb50 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -18,6 +18,18 @@
  * which are architecture-dependent.
  */
 
+struct rte_power_monitor_cond {
+	volatile void *addr;  /**< Address to monitor for changes */
+	uint64_t val;         /**< Before attempting the monitoring, the address
+	                       *   may be read and compared against this value.
+	                       **/
+	uint64_t mask;   /**< 64-bit mask to extract current value from addr */
+	uint8_t data_sz; /**< Data size (in bytes) that will be used to compare
+	                  *   expected value with the memory address. Can be 1,
+	                  *   2, 4, or 8. Supplying any other value will lead to
+	                  *   undefined result. */
+};
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
@@ -35,25 +47,15 @@
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  */
 __rte_experimental
-void rte_power_monitor(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz);
+void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp);
 
 /**
  * @warning
@@ -75,30 +77,19 @@ void rte_power_monitor(const volatile void *p,
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  * @param lck
  *   A spinlock that must be locked before entering the function, will be
  *   unlocked while the CPU is sleeping, and will be locked again once the CPU
  *   wakes up.
  */
 __rte_experimental
-void rte_power_monitor_sync(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz,
-		rte_spinlock_t *lck);
+void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
 
 /**
  * @warning
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 785effabe6..3897d2024d 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -7,36 +7,31 @@
 /**
  * This function is not supported on PPC64.
  */
-void rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		       const uint64_t value_mask, const uint64_t tsc_timestamp,
-		       const uint8_t data_sz)
+void
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on PPC64.
  */
-void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-			    const uint64_t value_mask, const uint64_t tsc_timestamp,
-			    const uint8_t data_sz, rte_spinlock_t *lck)
+void
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 }
 
 /**
  * This function is not supported on PPC64.
  */
-void rte_power_pause(const uint64_t tsc_timestamp)
+void
+rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
 }
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index a164ad55fc..9b0638148d 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -31,9 +31,8 @@ __get_umwait_val(const volatile void *p, const uint8_t sz)
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -50,14 +49,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return;
 	}
 	/* execute UMWAIT */
@@ -73,9 +73,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -92,14 +91,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return;
 	}
 	rte_spinlock_unlock(lck);
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 04/11] eal: remove sync version of power monitor
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
                               ` (2 preceding siblings ...)
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 03/11] eal: change API of " Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 05/11] eal: add monitor wakeup function Anatoly Burakov
                               ` (7 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, the "sync" version of power monitor intrinsic is supposed to
be used for purposes of waking up a sleeping core. However, there are
better ways to achieve the same result, so remove the unneeded function.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/arm/rte_power_intrinsics.c     | 12 -----
 .../include/generic/rte_power_intrinsics.h    | 34 --------------
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 12 -----
 lib/librte_eal/version.map                    |  1 -
 lib/librte_eal/x86/rte_power_intrinsics.c     | 46 -------------------
 5 files changed, 105 deletions(-)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index f2c3506b90..6b8219b919 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -15,18 +15,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	RTE_SET_USED(tsc_timestamp);
 }
 
-/**
- * This function is not supported on ARM.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-}
-
 /**
  * This function is not supported on ARM.
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 00c670cb50..a6f1955996 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -57,40 +57,6 @@ __rte_experimental
 void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
-/**
- * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
- *
- * Monitor specific address for changes. This will cause the CPU to enter an
- * architecture-defined optimized power state until either the specified
- * memory address is written to, a certain TSC timestamp is reached, or other
- * reasons cause the CPU to wake up.
- *
- * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
- * mask is non-zero, the current value pointed to by the `p` pointer will be
- * checked against the expected value, and if they match, the entering of
- * optimized power state may be aborted.
- *
- * This call will also lock a spinlock on entering sleep, and release it on
- * waking up the CPU.
- *
- * @warning It is responsibility of the user to check if this function is
- *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *
- * @param pmc
- *   The monitoring condition structure.
- * @param tsc_timestamp
- *   Maximum TSC timestamp to wait for. Note that the wait behavior is
- *   architecture-dependent.
- * @param lck
- *   A spinlock that must be locked before entering the function, will be
- *   unlocked while the CPU is sleeping, and will be locked again once the CPU
- *   wakes up.
- */
-__rte_experimental
-void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
-
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 3897d2024d..9a40c4d5d6 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -15,18 +15,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	RTE_SET_USED(tsc_timestamp);
 }
 
-/**
- * This function is not supported on PPC64.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-}
-
 /**
  * This function is not supported on PPC64.
  */
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 31bf76ae81..20945b1efa 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,7 +406,6 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
-	rte_power_monitor_sync;
 	rte_power_pause;
 };
 
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 9b0638148d..487a783a2c 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -67,52 +67,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 			  "a"(tsc_l), "d"(tsc_h));
 }
 
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-void
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* prevent user from running this instruction if it's not supported */
-	if (!wait_supported)
-		return;
-
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(pmc->addr));
-
-	if (pmc->mask) {
-		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->data_sz);
-		const uint64_t masked = cur_value & pmc->mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
-			return;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-}
-
 /**
  * This function uses TPAUSE instruction  and will enter C0.2 state. For more
  * information about usage of this instruction, please refer to Intel(R) 64 and
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 05/11] eal: add monitor wakeup function
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
                               ` (3 preceding siblings ...)
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 04/11] eal: remove sync version of power monitor Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 06/11] ethdev: add simple power management API Anatoly Burakov
                               ` (6 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Now that we have everything in a C file, we can store the information
about our sleep, and have a native mechanism to wake up the sleeping
core. This mechanism would however only wake up a core that's sleeping
while monitoring - waking up from `rte_power_pause` won't work.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v13:
    - Add comments around wakeup code to explain what it does
    - Add lcore_id parameter checking to prevent buffer overrun

 lib/librte_eal/arm/rte_power_intrinsics.c     |  9 ++
 .../include/generic/rte_power_intrinsics.h    | 16 ++++
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  9 ++
 lib/librte_eal/version.map                    |  1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 85 +++++++++++++++++++
 5 files changed, 120 insertions(+)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index 6b8219b919..14081a2c5b 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -23,3 +23,12 @@ rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
 }
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+}
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index a6f1955996..e311d6f8ea 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -57,6 +57,22 @@ __rte_experimental
 void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Wake up a specific lcore that is in a power optimized state and is monitoring
+ * an address.
+ *
+ * @note This function will *not* wake up a core that is in a power optimized
+ *   state due to calling `rte_power_pause`.
+ *
+ * @param lcore_id
+ *   Lcore ID of a sleeping thread.
+ */
+__rte_experimental
+void rte_power_monitor_wakeup(const unsigned int lcore_id);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 9a40c4d5d6..a7db61a7c3 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -23,3 +23,12 @@ rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
 }
+
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 20945b1efa..ac026e289d 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,6 +406,7 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
+	rte_power_monitor_wakeup;
 	rte_power_pause;
 };
 
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 487a783a2c..941da138ce 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -2,8 +2,31 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
+#include <rte_common.h>
+#include <rte_lcore.h>
+#include <rte_spinlock.h>
+
 #include "rte_power_intrinsics.h"
 
+/*
+ * Per-lcore structure holding current status of C0.2 sleeps.
+ */
+static struct power_wait_status {
+	rte_spinlock_t lock;
+	volatile void *monitor_addr; /**< NULL if not currently sleeping */
+} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+
+static inline void
+__umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	/* trigger a write but don't change the value */
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
 static bool wait_supported;
 
 static inline uint64_t
@@ -36,6 +59,12 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s;
+
+	/* prevent non-EAL thread from using this API */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
 
 	/* prevent user from running this instruction if it's not supported */
 	if (!wait_supported)
@@ -60,11 +89,24 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		if (masked == pmc->val)
 			return;
 	}
+
+	s = &wait_status[lcore_id];
+
+	/* update sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = pmc->addr;
+	rte_spinlock_unlock(&s->lock);
+
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			  "a"(tsc_l), "d"(tsc_h));
+
+	/* erase sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = NULL;
+	rte_spinlock_unlock(&s->lock);
 }
 
 /**
@@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) {
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
 }
+
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	struct power_wait_status *s;
+
+	/* prevent buffer overrun */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return;
+
+	s = &wait_status[lcore_id];
+
+	/*
+	 * There is a race condition between sleep, wakeup and locking, but we
+	 * don't need to handle it.
+	 *
+	 * Possible situations:
+	 *
+	 * 1. T1 locks, sets address, unlocks
+	 * 2. T2 locks, triggers wakeup, unlocks
+	 * 3. T1 sleeps
+	 *
+	 * In this case, because T1 has already set the address for monitoring,
+	 * we will wake up immediately even if T2 triggers wakeup before T1
+	 * goes to sleep.
+	 *
+	 * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up
+	 * 2. T2 locks, triggers wakeup, and unlocks
+	 * 3. T1 locks, erases address, and unlocks
+	 *
+	 * In this case, since we've already woken up, the "wakeup" was
+	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
+	 * wakeup address is still valid so it's perfectly safe to write it.
+	 */
+	rte_spinlock_lock(&s->lock);
+	if (s->monitor_addr != NULL)
+		__umwait_wakeup(s->monitor_addr);
+	rte_spinlock_unlock(&s->lock);
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 06/11] ethdev: add simple power management API
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
                               ` (4 preceding siblings ...)
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 05/11] eal: add monitor wakeup function Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 07/11] power: add PMD power management API and callback Anatoly Burakov
                               ` (5 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, konstantin.ananyev, timothy.mcdaniel,
	david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple API to allow getting the monitor conditions for
power-optimized monitoring of the Rx queues from the PMD, as well as
release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---

Notes:
    v13:
    - Fix typos and issues raised by Andrew

 doc/guides/rel_notes/release_21_02.rst |  5 +++++
 lib/librte_ethdev/rte_ethdev.c         | 28 ++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h         | 25 +++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h  | 22 ++++++++++++++++++++
 lib/librte_ethdev/version.map          |  3 +++
 5 files changed, 83 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index 638f98168b..6de0cb568e 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **ethdev: added new API for PMD power management**
+
+  * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
+    ``rte_power_monitor()`` to enable automatic power management for PMD's.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 17ddacc78d..e19dbd838b 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	if (pmc == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n",
+				pmc);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id],
+			pmc));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index f5f8919186..ca0f91312e 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -4334,6 +4335,30 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Retrieve the monitor condition for a given receive queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param pmc
+ *   The pointer point to power-optimized monitoring condition structure.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index 0eacfd8425..3b3b0ec1a0 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
 	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
 /**< @internal Unbind peer queue from the current queue. */
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an Rx queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param pmc
+ *   The pointer to power-optimized monitoring condition structure.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_monitor_addr_t)(void *rxq,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -917,6 +937,8 @@ struct eth_dev_ops {
 	/**< Set up the connection between the pair of hairpin queues. */
 	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
 	/**< Disconnect the hairpin queues of a pair from each other. */
+	eth_get_monitor_addr_t get_monitor_addr;
+	/**< Get power monitoring condition for Rx queue. */
 };
 
 /**
diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
index d3f5410806..a124e1e370 100644
--- a/lib/librte_ethdev/version.map
+++ b/lib/librte_ethdev/version.map
@@ -240,6 +240,9 @@ EXPERIMENTAL {
 	rte_flow_get_restore_info;
 	rte_flow_tunnel_action_decap_release;
 	rte_flow_tunnel_item_release;
+
+	# added in 21.02
+	rte_eth_get_monitor_addr;
 };
 
 INTERNAL {
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 07/11] power: add PMD power management API and callback
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
                               ` (5 preceding siblings ...)
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 06/11] ethdev: add simple power management API Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 08/11] net/ixgbe: implement power management API Anatoly Burakov
                               ` (4 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas,
	konstantin.ananyev, timothy.mcdaniel, bruce.richardson,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. TPAUSE/Pause instruction

   This method uses the pause (or TPAUSE, if available) instruction to
   avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v15:
    - Fix check in UMWAIT callback
    
    v13:
    - Rework the synchronization mechanism to not require locking
    - Add more parameter checking
    - Rework n_rx_queues access to not go through internal PMD structures and use
      public API instead
    
    v13:
    - Rework the synchronization mechanism to not require locking
    - Add more parameter checking
    - Rework n_rx_queues access to not go through internal PMD structures and use
      public API instead

 doc/guides/prog_guide/power_man.rst    |  44 +++
 doc/guides/rel_notes/release_21_02.rst |  10 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 359 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  90 +++++++
 lib/librte_power/version.map           |   5 +
 6 files changed, 511 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..02280dd689 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,47 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * Monitor
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX
+   descriptor address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not available, use ``rte_pause()``.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing ``librte_power`` library
+   functionality to scale the core frequency up/down depending on traffic
+   volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +241,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index 6de0cb568e..b34828cad6 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -60,6 +60,16 @@ New Features
   * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
     ``rte_power_monitor()`` to enable automatic power management for PMD's.
 
+* **Add PMD power management helper API**
+
+  A new helper API has been added to make using Ethernet PMD power management
+  easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power
+  management schemes are supported initially:
+
+  * Power saving based on UMWAIT instruction (x86 only)
+  * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only)
+  * Power saving based on frequency scaling through the ``librte_power`` library
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 4b4cf1b90b..51a471b669 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..470c3a912b
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,359 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+static struct pmd_conf_data {
+	struct rte_cpu_intrinsics intrinsics_support;
+	/**< what do we support? */
+	uint64_t tsc_per_us;
+	/**< pre-calculated tsc diff for 1us */
+	uint64_t pause_per_us;
+	/**< how many rte_pause can we fit in a microisecond? */
+} global_data;
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+	/** Device powermanagement status is about to change. */
+	PMD_MGMT_BUSY
+};
+
+struct pmd_queue_cfg {
+	volatile enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	volatile bool umwait_in_progress;
+	/**< are we currently sleeping? */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+static void
+calc_tsc(void)
+{
+	const uint64_t hz = rte_get_timer_hz();
+	const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */
+
+	global_data.tsc_per_us = tsc_per_us;
+
+	/* only do this if we don't have tpause */
+	if (!global_data.intrinsics_support.power_pause) {
+		const uint64_t start = rte_rdtsc_precise();
+		const uint32_t n_pauses = 10000;
+		double us, us_per_pause;
+		uint64_t end;
+		unsigned int i;
+
+		/* estimate number of rte_pause() calls per us*/
+		for (i = 0; i < n_pauses; i++)
+			rte_pause();
+
+		end = rte_rdtsc_precise();
+		us = (end - start) / (double)tsc_per_us;
+		us_per_pause = us / n_pauses;
+
+		global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause);
+	}
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc;
+			uint16_t ret;
+
+			/*
+			 * we might get a cancellation request while being
+			 * inside the callback, in which case the wakeup
+			 * wouldn't work because it would've arrived too early.
+			 *
+			 * to get around this, we notify the other thread that
+			 * we're sleeping, so that it can spin until we're done.
+			 * unsolicited wakeups are perfectly safe.
+			 */
+			q_conf->umwait_in_progress = true;
+
+			/* check if we need to cancel sleep */
+			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+				/* use monitoring condition to sleep */
+				ret = rte_eth_get_monitor_addr(port_id, qidx,
+						&pmc);
+				if (ret == 0)
+					rte_power_monitor(&pmc, -1ULL);
+			}
+			q_conf->umwait_in_progress = false;
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/* use tpause if we have it */
+			if (global_data.intrinsics_support.power_pause) {
+				const uint64_t cur = rte_rdtsc();
+				const uint64_t wait_tsc =
+						cur + global_data.tsc_per_us;
+				rte_power_pause(wait_tsc);
+			} else {
+				uint64_t i;
+				for (i = 0; i < global_data.pause_per_us; i++)
+					rte_pause();
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
+		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	struct rte_eth_dev_info info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if (rte_eth_dev_info_get(port_id, &info) < 0) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* check if queue id is valid */
+	if (queue_id >= info.nb_rx_queues) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* we're about to change our state */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
+
+	/* we need this in various places */
+	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		struct rte_power_monitor_cond dummy;
+
+		/* check if rte_power_monitor is supported */
+		if (!global_data.intrinsics_support.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_monitor_addr(port_id, queue_id,
+				&dummy) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->umwait_in_progress = false;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto rollback;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* figure out various time-to-tsc conversions */
+		if (global_data.tsc_per_us == 0)
+			calc_tsc();
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+	return ret;
+
+rollback:
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	/* let the callback know we're shutting down */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		bool exit = false;
+		do {
+			/*
+			 * we may request cancellation while the other thread
+			 * has just entered the callback but hasn't started
+			 * sleeping yet, so keep waking it up until we know it's
+			 * done sleeping.
+			 */
+			if (queue_cfg->umwait_in_progress)
+				rte_power_monitor_wakeup(lcore_id);
+			else
+				exit = true;
+		} while (!exit);
+	}
+	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..0bfbc6ba69
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** Use power-optimized monitoring to wait for incoming traffic */
+	RTE_POWER_MGMT_TYPE_MONITOR = 1,
+	/** Use power-optimized sleep to avoid busy polling */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Use frequency scaling when traffic is low */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..61996b4d11 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,9 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+
+	# added in 21.02
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 08/11] net/ixgbe: implement power management API
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
                               ` (6 preceding siblings ...)
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 07/11] power: add PMD power management API and callback Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 09/11] net/i40e: " Anatoly Burakov
                               ` (3 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jeff Guo, Haiyue Wang, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 9a47a8b262..4b7a5ca60b 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -560,6 +560,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_monitor_addr     = ixgbe_get_monitor_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 6cfbb582e2..7e046a1819 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int
+ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	pmc->data_sz = sizeof(uint32_t);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 6d2f7c9da3..8a25e98df6 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,6 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 09/11] net/i40e: implement power management API
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
                               ` (7 preceding siblings ...)
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 08/11] net/ixgbe: implement power management API Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 10/11] net/ice: " Anatoly Burakov
                               ` (2 subsequent siblings)
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Beilei Xing, Jeff Guo, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index f54769c29d..af2577a140 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -510,6 +510,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_monitor_addr             = i40e_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5df9a9df56..0b4220fc9c 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -72,6 +72,31 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	pmc->data_sz = sizeof(uint64_t);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..e1494525ce 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,7 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 10/11] net/ice: implement power management API
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
                               ` (8 preceding siblings ...)
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 09/11] net/i40e: " Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Qiming Yang, Qi Zhang, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  1 +
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 9a5d6a559f..c21682c120 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_monitor_addr             = ice_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 5fbd68eafc..fa9e9a235b 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int
+ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	pmc->data_sz = sizeof(uint16_t);
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 6b16716063..906fbefdc4 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -263,6 +263,7 @@ uint16_t ice_xmit_pkts_vec_avx512(void *tx_queue, struct rte_mbuf **tx_pkts,
 				  uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v15 11/11] examples/l3fwd-power: enable PMD power mgmt
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
                               ` (9 preceding siblings ...)
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 10/11] net/ice: " Anatoly Burakov
@ 2021-01-11 14:58             ` Anatoly Burakov
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
  11 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, thomas, konstantin.ananyev,
	timothy.mcdaniel, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v12:
    - Allow selecting PMD power management scheme from command-line
    - Enforce 1 core 1 queue rule

 .../sample_app_ug/l3_forward_power_man.rst    | 35 ++++++++
 examples/l3fwd-power/main.c                   | 89 ++++++++++++++++++-
 2 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 85a78a5c1e..aaa9367fae 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
+
+PMD Power Management Mode
+-------------------------
+There is also a traffic-aware operating mode that, instead of using explicit
+power management, will use automatic PMD power management. This mode is limited
+to one queue per core, and has three available power management schemes:
+
+* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a
+  power-optimized state (subject to platform support).
+
+* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid
+  busy looping when there is no traffic.
+
+* ``scale`` - this will use frequency scaling routines available in the
+  ``librte_power`` library.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK
+Programmer's Guide for more details on PMD power management.
+
+.. code-block:: console
+
+        ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 995a3b6ad7..e312b6f355 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,11 +200,14 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
 
+static enum rte_power_pmd_mgmt_type pmgmt_type;
+
 enum freq_scale_hint_t
 {
 	FREQ_LOWER    =      -1,
@@ -1611,7 +1615,9 @@ print_usage(const char *prgname)
 		" follow (training_flag, high_threshold, med_threshold)\n"
 		" --telemetry: enable telemetry mode, to update"
 		" empty polls, full polls, and core busyness to telemetry\n"
-		" --interrupt-only: enable interrupt-only mode\n",
+		" --interrupt-only: enable interrupt-only mode\n"
+		" --pmd-mgmt MODE: enable PMD power management mode. "
+		"Currently supported modes: monitor, pause, scale\n",
 		prgname);
 }
 
@@ -1701,6 +1707,32 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+
+static int
+parse_pmd_mgmt_config(const char *name)
+{
+#define PMD_MGMT_MONITOR "monitor"
+#define PMD_MGMT_PAUSE   "pause"
+#define PMD_MGMT_SCALE   "scale"
+
+	if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE;
+		return 0;
+	}
+	/* unknown PMD power management mode */
+	return -1;
+}
+
 static int
 parse_ep_config(const char *q_arg)
 {
@@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 1, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				if (parse_pmd_mgmt_config(optarg) < 0) {
+					printf(" Invalid PMD power management mode: %s\n",
+							optarg);
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2671,6 +2722,13 @@ main(int argc, char **argv)
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
+
+		/* PMD power management mode can only do 1 queue per core */
+		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
+			rte_exit(EXIT_FAILURE,
+				"In PMD power management mode, only one queue per lcore is allowed\n");
+		}
+
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2708,6 +2766,16 @@ main(int argc, char **argv)
 					rte_exit(EXIT_FAILURE,
 						 "Fail to add ptype cb\n");
 			}
+
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_pmd_mgmt_queue_enable(
+						lcore_id, portid, queueid,
+						pmgmt_type);
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+						"rte_power_pmd_mgmt_queue_enable: err=%d, port=%d\n",
+							ret, portid);
+			}
 		}
 	}
 
@@ -2798,6 +2866,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		/* reuse telemetry loop for PMD power management mode */
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2824,6 +2895,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics Anatoly Burakov
@ 2021-01-12 15:54           ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-12 15:54 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, McDaniel,
	Timothy, Hunt, David, Macnamara, Chris



> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Friday, January 8, 2021 5:42 PM
> To: dev@dpdk.org
> Cc: Jan Viktorin <viktorin@rehivetech.com>; Ruifeng Wang <ruifeng.wang@arm.com>; Jerin Jacob <jerinj@marvell.com>; David
> Christensen <drc@linux.vnet.ibm.com>; Ray Kinsella <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; thomas@monjalon.net; gage.eads@intel.com;
> McDaniel, Timothy <timothy.mcdaniel@intel.com>; Hunt, David <david.hunt@intel.com>; Macnamara, Chris
> <chris.macnamara@intel.com>
> Subject: [PATCH v13 01/11] eal: uninline power intrinsics
> 
> Currently, power intrinsics are inline functions. Make them part of the
> ABI so that we can have various internal data associated with them
> without exposing said data to the outside world.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> --
> 2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in power intrinsics
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in " Anatoly Burakov
  2021-01-08 19:58           ` Stephen Hemminger
@ 2021-01-12 15:56           ` Ananyev, Konstantin
  1 sibling, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-12 15:56 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Richardson, Bruce, thomas, gage.eads, McDaniel, Timothy, Hunt,
	David, Macnamara, Chris

> Currently, the API documentation mandates that if the user wants to use
> the power management intrinsics, they need to call the
> `rte_cpu_get_intrinsics_support` API and check support for specific
> intrinsics.
> 
> However, if the user does not do that, it is possible to get illegal
> instruction error because we're using raw instruction opcodes, which may
> or may not be supported at runtime.
> 
> Now that we have everything in a C file, we can check for support at
> startup and prevent the user from possibly encountering illegal
> instruction errors.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  .../include/generic/rte_power_intrinsics.h    |  3 --
>  lib/librte_eal/x86/rte_power_intrinsics.c     | 31 +++++++++++++++++--
>  2 files changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> index 67977bd511..ffa72f7578 100644
> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -34,7 +34,6 @@
>   *
>   * @warning It is responsibility of the user to check if this function is
>   *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
> - *   Failing to do so may result in an illegal CPU instruction error.
>   *
>   * @param p
>   *   Address to monitor for changes.
> @@ -75,7 +74,6 @@ void rte_power_monitor(const volatile void *p,
>   *
>   * @warning It is responsibility of the user to check if this function is
>   *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
> - *   Failing to do so may result in an illegal CPU instruction error.
>   *
>   * @param p
>   *   Address to monitor for changes.
> @@ -111,7 +109,6 @@ void rte_power_monitor_sync(const volatile void *p,
>   *
>   * @warning It is responsibility of the user to check if this function is
>   *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
> - *   Failing to do so may result in an illegal CPU instruction error.
>   *
>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
> index 34c5fd9c3e..b48a54ec7f 100644
> --- a/lib/librte_eal/x86/rte_power_intrinsics.c
> +++ b/lib/librte_eal/x86/rte_power_intrinsics.c
> @@ -4,6 +4,8 @@
> 
>  #include "rte_power_intrinsics.h"
> 
> +static uint8_t wait_supported;
> +
>  static inline uint64_t
>  __get_umwait_val(const volatile void *p, const uint8_t sz)
>  {
> @@ -35,6 +37,11 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
>  {
>  	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
>  	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +
> +	/* prevent user from running this instruction if it's not supported */
> +	if (!wait_supported)
> +		return;
> +
>  	/*
>  	 * we're using raw byte codes for now as only the newest compiler
>  	 * versions support this instruction natively.
> @@ -72,6 +79,11 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
>  {
>  	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
>  	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +
> +	/* prevent user from running this instruction if it's not supported */
> +	if (!wait_supported)
> +		return;
> +
>  	/*
>  	 * we're using raw byte codes for now as only the newest compiler
>  	 * versions support this instruction natively.
> @@ -112,9 +124,22 @@ rte_power_pause(const uint64_t tsc_timestamp)
>  	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
>  	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> 
> +	/* prevent user from running this instruction if it's not supported */
> +	if (!wait_supported)
> +		return;
> +
>  	/* execute TPAUSE */
>  	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
> -			: /* ignore rflags */
> -			: "D"(0), /* enter C0.2 */
> -			"a"(tsc_l), "d"(tsc_h));
> +		: /* ignore rflags */
> +		: "D"(0), /* enter C0.2 */
> +		  "a"(tsc_l), "d"(tsc_h));
> +}
> +
> +RTE_INIT(rte_power_intrinsics_init) {
> +	struct rte_cpu_intrinsics i;
> +
> +	rte_cpu_get_intrinsics_support(&i);
> +
> +	if (i.power_monitor && i.power_pause)
> +		wait_supported = 1;
>  }
> --

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v13 03/11] eal: change API of power intrinsics
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 03/11] eal: change API of " Anatoly Burakov
@ 2021-01-12 15:58           ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-12 15:58 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: McDaniel, Timothy, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Richardson, Bruce, thomas, gage.eads, Hunt,
	David, Macnamara, Chris


> 
> Instead of passing around pointers and integers, collect everything
> into struct. This makes API design around these intrinsics much easier.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> --
> 2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor Anatoly Burakov
@ 2021-01-12 15:59           ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-12 15:59 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, gage.eads,
	McDaniel, Timothy, Hunt, David, Macnamara, Chris



> 
> Currently, the "sync" version of power monitor intrinsic is supposed to
> be used for purposes of waking up a sleeping core. However, there are
> better ways to achieve the same result, so remove the unneeded function.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> --
> 2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function
  2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function Anatoly Burakov
@ 2021-01-12 16:02           ` Ananyev, Konstantin
  2021-01-12 16:18             ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-12 16:02 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, gage.eads,
	McDaniel, Timothy, Hunt, David, Macnamara, Chris


> 
> Now that we have everything in a C file, we can store the information
> about our sleep, and have a native mechanism to wake up the sleeping
> core. This mechanism would however only wake up a core that's sleeping
> while monitoring - waking up from `rte_power_pause` won't work.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v13:
>     - Add comments around wakeup code to explain what it does
>     - Add lcore_id parameter checking to prevent buffer overrun
> 
>  .../arm/include/rte_power_intrinsics.h        |  9 ++
>  .../include/generic/rte_power_intrinsics.h    | 16 ++++
>  .../ppc/include/rte_power_intrinsics.h        |  9 ++
>  lib/librte_eal/version.map                    |  1 +
>  lib/librte_eal/x86/rte_power_intrinsics.c     | 85 +++++++++++++++++++
>  5 files changed, 120 insertions(+)
> 
> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> index 27869251a8..39e49cc45b 100644
> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp)
>  	RTE_SET_USED(tsc_timestamp);
>  }
> 
> +/**
> + * This function is not supported on ARM.
> + */
> +void
> +rte_power_monitor_wakeup(const unsigned int lcore_id)
> +{
> +	RTE_SET_USED(lcore_id);
> +}
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> index a6f1955996..e311d6f8ea 100644
> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -57,6 +57,22 @@ __rte_experimental
>  void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>  		const uint64_t tsc_timestamp);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Wake up a specific lcore that is in a power optimized state and is monitoring
> + * an address.
> + *
> + * @note This function will *not* wake up a core that is in a power optimized
> + *   state due to calling `rte_power_pause`.
> + *
> + * @param lcore_id
> + *   Lcore ID of a sleeping thread.
> + */
> +__rte_experimental
> +void rte_power_monitor_wakeup(const unsigned int lcore_id);
> +
>  /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice
> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> index 248d1f4a23..2e7db0e7eb 100644
> --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp)
>  	RTE_SET_USED(tsc_timestamp);
>  }
> 
> +/**
> + * This function is not supported on PPC64.
> + */
> +void
> +rte_power_monitor_wakeup(const unsigned int lcore_id)
> +{
> +	RTE_SET_USED(lcore_id);
> +}
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
> index 20945b1efa..ac026e289d 100644
> --- a/lib/librte_eal/version.map
> +++ b/lib/librte_eal/version.map
> @@ -406,6 +406,7 @@ EXPERIMENTAL {
> 
>  	# added in 21.02
>  	rte_power_monitor;
> +	rte_power_monitor_wakeup;
>  	rte_power_pause;
>  };
> 
> diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
> index a9cd1afe9d..46a4fb6cd5 100644
> --- a/lib/librte_eal/x86/rte_power_intrinsics.c
> +++ b/lib/librte_eal/x86/rte_power_intrinsics.c
> @@ -2,8 +2,31 @@
>   * Copyright(c) 2020 Intel Corporation
>   */
> 
> +#include <rte_common.h>
> +#include <rte_lcore.h>
> +#include <rte_spinlock.h>
> +
>  #include "rte_power_intrinsics.h"
> 
> +/*
> + * Per-lcore structure holding current status of C0.2 sleeps.
> + */
> +static struct power_wait_status {
> +	rte_spinlock_t lock;
> +	volatile void *monitor_addr; /**< NULL if not currently sleeping */
> +} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
> +
> +static inline void
> +__umwait_wakeup(volatile void *addr)
> +{
> +	uint64_t val;
> +
> +	/* trigger a write but don't change the value */
> +	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
> +	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
> +			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
> +}
> +
>  static uint8_t wait_supported;
> 
>  static inline uint64_t
> @@ -36,6 +59,12 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>  {
>  	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
>  	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	const unsigned int lcore_id = rte_lcore_id();
> +	struct power_wait_status *s;
> +
> +	/* prevent non-EAL thread from using this API */
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return;
> 
>  	/* prevent user from running this instruction if it's not supported */
>  	if (!wait_supported)
> @@ -60,11 +89,24 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>  		if (masked == pmc->val)
>  			return;
>  	}
> +
> +	s = &wait_status[lcore_id];
> +
> +	/* update sleep address */
> +	rte_spinlock_lock(&s->lock);
> +	s->monitor_addr = pmc->addr;
> +	rte_spinlock_unlock(&s->lock);

It was a while, since I looked at it last time,
but shouldn't we grab the lock before monitor()?
I.E:
lock();
monitor();
addr=...;
unlock();
umwait();

> +
>  	/* execute UMWAIT */
>  	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
>  			: /* ignore rflags */
>  			: "D"(0), /* enter C0.2 */
>  			  "a"(tsc_l), "d"(tsc_h));
> +
> +	/* erase sleep address */
> +	rte_spinlock_lock(&s->lock);
> +	s->monitor_addr = NULL;
> +	rte_spinlock_unlock(&s->lock);
>  }
> 
>  /**
> @@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) {
>  	if (i.power_monitor && i.power_pause)
>  		wait_supported = 1;
>  }
> +
> +void
> +rte_power_monitor_wakeup(const unsigned int lcore_id)
> +{
> +	struct power_wait_status *s;
> +
> +	/* prevent buffer overrun */
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return;
> +
> +	/* prevent user from running this instruction if it's not supported */
> +	if (!wait_supported)
> +		return;
> +
> +	s = &wait_status[lcore_id];
> +
> +	/*
> +	 * There is a race condition between sleep, wakeup and locking, but we
> +	 * don't need to handle it.
> +	 *
> +	 * Possible situations:
> +	 *
> +	 * 1. T1 locks, sets address, unlocks
> +	 * 2. T2 locks, triggers wakeup, unlocks
> +	 * 3. T1 sleeps
> +	 *
> +	 * In this case, because T1 has already set the address for monitoring,
> +	 * we will wake up immediately even if T2 triggers wakeup before T1
> +	 * goes to sleep.
> +	 *
> +	 * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up
> +	 * 2. T2 locks, triggers wakeup, and unlocks
> +	 * 3. T1 locks, erases address, and unlocks
> +	 *
> +	 * In this case, since we've already woken up, the "wakeup" was
> +	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
> +	 * wakeup address is still valid so it's perfectly safe to write it.
> +	 */
> +	rte_spinlock_lock(&s->lock);
> +	if (s->monitor_addr != NULL)
> +		__umwait_wakeup(s->monitor_addr);
> +	rte_spinlock_unlock(&s->lock);
> +}
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics Anatoly Burakov
@ 2021-01-12 16:09               ` Ananyev, Konstantin
  2021-01-12 16:14                 ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-12 16:09 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, McDaniel,
	Timothy, Hunt, David, Macnamara, Chris


> diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
> new file mode 100644
> index 0000000000..34c5fd9c3e
> --- /dev/null
> +++ b/lib/librte_eal/x86/rte_power_intrinsics.c
> @@ -0,0 +1,120 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#include "rte_power_intrinsics.h"
> +
> +static inline uint64_t
> +__get_umwait_val(const volatile void *p, const uint8_t sz)
> +{
> +	switch (sz) {
> +	case sizeof(uint8_t):
> +		return *(const volatile uint8_t *)p;
> +	case sizeof(uint16_t):
> +		return *(const volatile uint16_t *)p;
> +	case sizeof(uint32_t):
> +		return *(const volatile uint32_t *)p;
> +	case sizeof(uint64_t):
> +		return *(const volatile uint64_t *)p;
> +	default:
> +		/* this is an intrinsic, so we can't have any error handling */
> +		RTE_ASSERT(0);
> +		return 0;

Nearly forgot - as now this function is not inline anymore, we can probably
get rid of assert and return some error code instead?

> +	}
> +}
> +

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics
  2021-01-12 16:09               ` Ananyev, Konstantin
@ 2021-01-12 16:14                 ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-12 16:14 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, McDaniel,
	Timothy, Hunt, David, Macnamara, Chris

On 12-Jan-21 4:09 PM, Ananyev, Konstantin wrote:
> 
>> diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
>> new file mode 100644
>> index 0000000000..34c5fd9c3e
>> --- /dev/null
>> +++ b/lib/librte_eal/x86/rte_power_intrinsics.c
>> @@ -0,0 +1,120 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2020 Intel Corporation
>> + */
>> +
>> +#include "rte_power_intrinsics.h"
>> +
>> +static inline uint64_t
>> +__get_umwait_val(const volatile void *p, const uint8_t sz)
>> +{
>> +switch (sz) {
>> +case sizeof(uint8_t):
>> +return *(const volatile uint8_t *)p;
>> +case sizeof(uint16_t):
>> +return *(const volatile uint16_t *)p;
>> +case sizeof(uint32_t):
>> +return *(const volatile uint32_t *)p;
>> +case sizeof(uint64_t):
>> +return *(const volatile uint64_t *)p;
>> +default:
>> +/* this is an intrinsic, so we can't have any error handling */
>> +RTE_ASSERT(0);
>> +return 0;
> 
> Nearly forgot - as now this function is not inline anymore, we can probably
> get rid of assert and return some error code instead?
> 

Well, this would necessitate a change of API to include return values. 
Which i think is OK at this point, because it's a fully fledged API 
(rather than an intrinsic) at this point anyway.

>> +}
>> +}
>> +


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function
  2021-01-12 16:02           ` Ananyev, Konstantin
@ 2021-01-12 16:18             ` Burakov, Anatoly
  2021-01-12 16:25               ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-12 16:18 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, gage.eads,
	McDaniel, Timothy, Hunt, David, Macnamara, Chris

On 12-Jan-21 4:02 PM, Ananyev, Konstantin wrote:
> 
>>
>> Now that we have everything in a C file, we can store the information
>> about our sleep, and have a native mechanism to wake up the sleeping
>> core. This mechanism would however only wake up a core that's sleeping
>> while monitoring - waking up from `rte_power_pause` won't work.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v13:
>>      - Add comments around wakeup code to explain what it does
>>      - Add lcore_id parameter checking to prevent buffer overrun
>>
>>   .../arm/include/rte_power_intrinsics.h        |  9 ++
>>   .../include/generic/rte_power_intrinsics.h    | 16 ++++
>>   .../ppc/include/rte_power_intrinsics.h        |  9 ++
>>   lib/librte_eal/version.map                    |  1 +
>>   lib/librte_eal/x86/rte_power_intrinsics.c     | 85 +++++++++++++++++++
>>   5 files changed, 120 insertions(+)
>>
>> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> index 27869251a8..39e49cc45b 100644
>> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp)
>>   RTE_SET_USED(tsc_timestamp);
>>   }
>>
>> +/**
>> + * This function is not supported on ARM.
>> + */
>> +void
>> +rte_power_monitor_wakeup(const unsigned int lcore_id)
>> +{
>> +RTE_SET_USED(lcore_id);
>> +}
>> +
>>   #ifdef __cplusplus
>>   }
>>   #endif
>> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>> index a6f1955996..e311d6f8ea 100644
>> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
>> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>> @@ -57,6 +57,22 @@ __rte_experimental
>>   void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>   const uint64_t tsc_timestamp);
>>
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Wake up a specific lcore that is in a power optimized state and is monitoring
>> + * an address.
>> + *
>> + * @note This function will *not* wake up a core that is in a power optimized
>> + *   state due to calling `rte_power_pause`.
>> + *
>> + * @param lcore_id
>> + *   Lcore ID of a sleeping thread.
>> + */
>> +__rte_experimental
>> +void rte_power_monitor_wakeup(const unsigned int lcore_id);
>> +
>>   /**
>>    * @warning
>>    * @b EXPERIMENTAL: this API may change without prior notice
>> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
>> index 248d1f4a23..2e7db0e7eb 100644
>> --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
>> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
>> @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp)
>>   RTE_SET_USED(tsc_timestamp);
>>   }
>>
>> +/**
>> + * This function is not supported on PPC64.
>> + */
>> +void
>> +rte_power_monitor_wakeup(const unsigned int lcore_id)
>> +{
>> +RTE_SET_USED(lcore_id);
>> +}
>> +
>>   #ifdef __cplusplus
>>   }
>>   #endif
>> diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
>> index 20945b1efa..ac026e289d 100644
>> --- a/lib/librte_eal/version.map
>> +++ b/lib/librte_eal/version.map
>> @@ -406,6 +406,7 @@ EXPERIMENTAL {
>>
>>   # added in 21.02
>>   rte_power_monitor;
>> +rte_power_monitor_wakeup;
>>   rte_power_pause;
>>   };
>>
>> diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
>> index a9cd1afe9d..46a4fb6cd5 100644
>> --- a/lib/librte_eal/x86/rte_power_intrinsics.c
>> +++ b/lib/librte_eal/x86/rte_power_intrinsics.c
>> @@ -2,8 +2,31 @@
>>    * Copyright(c) 2020 Intel Corporation
>>    */
>>
>> +#include <rte_common.h>
>> +#include <rte_lcore.h>
>> +#include <rte_spinlock.h>
>> +
>>   #include "rte_power_intrinsics.h"
>>
>> +/*
>> + * Per-lcore structure holding current status of C0.2 sleeps.
>> + */
>> +static struct power_wait_status {
>> +rte_spinlock_t lock;
>> +volatile void *monitor_addr; /**< NULL if not currently sleeping */
>> +} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
>> +
>> +static inline void
>> +__umwait_wakeup(volatile void *addr)
>> +{
>> +uint64_t val;
>> +
>> +/* trigger a write but don't change the value */
>> +val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
>> +__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
>> +__ATOMIC_RELAXED, __ATOMIC_RELAXED);
>> +}
>> +
>>   static uint8_t wait_supported;
>>
>>   static inline uint64_t
>> @@ -36,6 +59,12 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>   {
>>   const uint32_t tsc_l = (uint32_t)tsc_timestamp;
>>   const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
>> +const unsigned int lcore_id = rte_lcore_id();
>> +struct power_wait_status *s;
>> +
>> +/* prevent non-EAL thread from using this API */
>> +if (lcore_id >= RTE_MAX_LCORE)
>> +return;
>>
>>   /* prevent user from running this instruction if it's not supported */
>>   if (!wait_supported)
>> @@ -60,11 +89,24 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>   if (masked == pmc->val)
>>   return;
>>   }
>> +
>> +s = &wait_status[lcore_id];
>> +
>> +/* update sleep address */
>> +rte_spinlock_lock(&s->lock);
>> +s->monitor_addr = pmc->addr;
>> +rte_spinlock_unlock(&s->lock);
> 
> It was a while, since I looked at it last time,
> but shouldn't we grab the lock before monitor()?
> I.E:
> lock();
> monitor();
> addr=...;
> unlock();
> umwait();
> 

I don't believe so.

The idea here is to only store the address when we are looking to sleep, 
and avoid the locks entirely if we already know we aren't going to 
sleep. I mean, technically we could lock unconditionally, then unlock 
when we're done, but there's very little practical difference between 
the two because the moment we are interested in (sleep) happens the same 
way whether we lock before or after monitor().

>> +
>>   /* execute UMWAIT */
>>   asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
>>   : /* ignore rflags */
>>   : "D"(0), /* enter C0.2 */
>>     "a"(tsc_l), "d"(tsc_h));
>> +
>> +/* erase sleep address */
>> +rte_spinlock_lock(&s->lock);
>> +s->monitor_addr = NULL;
>> +rte_spinlock_unlock(&s->lock);
>>   }
>>
>>   /**
>> @@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) {
>>   if (i.power_monitor && i.power_pause)
>>   wait_supported = 1;
>>   }
>> +
>> +void
>> +rte_power_monitor_wakeup(const unsigned int lcore_id)
>> +{
>> +struct power_wait_status *s;
>> +
>> +/* prevent buffer overrun */
>> +if (lcore_id >= RTE_MAX_LCORE)
>> +return;
>> +
>> +/* prevent user from running this instruction if it's not supported */
>> +if (!wait_supported)
>> +return;
>> +
>> +s = &wait_status[lcore_id];
>> +
>> +/*
>> + * There is a race condition between sleep, wakeup and locking, but we
>> + * don't need to handle it.
>> + *
>> + * Possible situations:
>> + *
>> + * 1. T1 locks, sets address, unlocks
>> + * 2. T2 locks, triggers wakeup, unlocks
>> + * 3. T1 sleeps
>> + *
>> + * In this case, because T1 has already set the address for monitoring,
>> + * we will wake up immediately even if T2 triggers wakeup before T1
>> + * goes to sleep.
>> + *
>> + * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up
>> + * 2. T2 locks, triggers wakeup, and unlocks
>> + * 3. T1 locks, erases address, and unlocks
>> + *
>> + * In this case, since we've already woken up, the "wakeup" was
>> + * unneeded, and since T1 is still waiting on T2 releasing the lock, the
>> + * wakeup address is still valid so it's perfectly safe to write it.
>> + */
>> +rte_spinlock_lock(&s->lock);
>> +if (s->monitor_addr != NULL)
>> +__umwait_wakeup(s->monitor_addr);
>> +rte_spinlock_unlock(&s->lock);
>> +}
>> --
>> 2.25.1


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function
  2021-01-12 16:18             ` Burakov, Anatoly
@ 2021-01-12 16:25               ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-12 16:25 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, gage.eads,
	McDaniel, Timothy, Hunt, David, Macnamara, Chris

On 12-Jan-21 4:18 PM, Burakov, Anatoly wrote:
> On 12-Jan-21 4:02 PM, Ananyev, Konstantin wrote:
>>
>>>
>>> Now that we have everything in a C file, we can store the information
>>> about our sleep, and have a native mechanism to wake up the sleeping
>>> core. This mechanism would however only wake up a core that's sleeping
>>> while monitoring - waking up from `rte_power_pause` won't work.
>>>
>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>> ---
>>>
>>> Notes:
>>>      v13:
>>>      - Add comments around wakeup code to explain what it does
>>>      - Add lcore_id parameter checking to prevent buffer overrun
>>>
>>>   .../arm/include/rte_power_intrinsics.h        |  9 ++
>>>   .../include/generic/rte_power_intrinsics.h    | 16 ++++
>>>   .../ppc/include/rte_power_intrinsics.h        |  9 ++
>>>   lib/librte_eal/version.map                    |  1 +
>>>   lib/librte_eal/x86/rte_power_intrinsics.c     | 85 +++++++++++++++++++
>>>   5 files changed, 120 insertions(+)
>>>
>>> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h 
>>> b/lib/librte_eal/arm/include/rte_power_intrinsics.h
>>> index 27869251a8..39e49cc45b 100644
>>> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
>>> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
>>> @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp)
>>>   RTE_SET_USED(tsc_timestamp);
>>>   }
>>>
>>> +/**
>>> + * This function is not supported on ARM.
>>> + */
>>> +void
>>> +rte_power_monitor_wakeup(const unsigned int lcore_id)
>>> +{
>>> +RTE_SET_USED(lcore_id);
>>> +}
>>> +
>>>   #ifdef __cplusplus
>>>   }
>>>   #endif
>>> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h 
>>> b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>> index a6f1955996..e311d6f8ea 100644
>>> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
>>> @@ -57,6 +57,22 @@ __rte_experimental
>>>   void rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>>   const uint64_t tsc_timestamp);
>>>
>>> +/**
>>> + * @warning
>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>> + *
>>> + * Wake up a specific lcore that is in a power optimized state and 
>>> is monitoring
>>> + * an address.
>>> + *
>>> + * @note This function will *not* wake up a core that is in a power 
>>> optimized
>>> + *   state due to calling `rte_power_pause`.
>>> + *
>>> + * @param lcore_id
>>> + *   Lcore ID of a sleeping thread.
>>> + */
>>> +__rte_experimental
>>> +void rte_power_monitor_wakeup(const unsigned int lcore_id);
>>> +
>>>   /**
>>>    * @warning
>>>    * @b EXPERIMENTAL: this API may change without prior notice
>>> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h 
>>> b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
>>> index 248d1f4a23..2e7db0e7eb 100644
>>> --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
>>> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
>>> @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp)
>>>   RTE_SET_USED(tsc_timestamp);
>>>   }
>>>
>>> +/**
>>> + * This function is not supported on PPC64.
>>> + */
>>> +void
>>> +rte_power_monitor_wakeup(const unsigned int lcore_id)
>>> +{
>>> +RTE_SET_USED(lcore_id);
>>> +}
>>> +
>>>   #ifdef __cplusplus
>>>   }
>>>   #endif
>>> diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
>>> index 20945b1efa..ac026e289d 100644
>>> --- a/lib/librte_eal/version.map
>>> +++ b/lib/librte_eal/version.map
>>> @@ -406,6 +406,7 @@ EXPERIMENTAL {
>>>
>>>   # added in 21.02
>>>   rte_power_monitor;
>>> +rte_power_monitor_wakeup;
>>>   rte_power_pause;
>>>   };
>>>
>>> diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c 
>>> b/lib/librte_eal/x86/rte_power_intrinsics.c
>>> index a9cd1afe9d..46a4fb6cd5 100644
>>> --- a/lib/librte_eal/x86/rte_power_intrinsics.c
>>> +++ b/lib/librte_eal/x86/rte_power_intrinsics.c
>>> @@ -2,8 +2,31 @@
>>>    * Copyright(c) 2020 Intel Corporation
>>>    */
>>>
>>> +#include <rte_common.h>
>>> +#include <rte_lcore.h>
>>> +#include <rte_spinlock.h>
>>> +
>>>   #include "rte_power_intrinsics.h"
>>>
>>> +/*
>>> + * Per-lcore structure holding current status of C0.2 sleeps.
>>> + */
>>> +static struct power_wait_status {
>>> +rte_spinlock_t lock;
>>> +volatile void *monitor_addr; /**< NULL if not currently sleeping */
>>> +} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
>>> +
>>> +static inline void
>>> +__umwait_wakeup(volatile void *addr)
>>> +{
>>> +uint64_t val;
>>> +
>>> +/* trigger a write but don't change the value */
>>> +val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
>>> +__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
>>> +__ATOMIC_RELAXED, __ATOMIC_RELAXED);
>>> +}
>>> +
>>>   static uint8_t wait_supported;
>>>
>>>   static inline uint64_t
>>> @@ -36,6 +59,12 @@ rte_power_monitor(const struct 
>>> rte_power_monitor_cond *pmc,
>>>   {
>>>   const uint32_t tsc_l = (uint32_t)tsc_timestamp;
>>>   const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
>>> +const unsigned int lcore_id = rte_lcore_id();
>>> +struct power_wait_status *s;
>>> +
>>> +/* prevent non-EAL thread from using this API */
>>> +if (lcore_id >= RTE_MAX_LCORE)
>>> +return;
>>>
>>>   /* prevent user from running this instruction if it's not supported */
>>>   if (!wait_supported)
>>> @@ -60,11 +89,24 @@ rte_power_monitor(const struct 
>>> rte_power_monitor_cond *pmc,
>>>   if (masked == pmc->val)
>>>   return;
>>>   }
>>> +
>>> +s = &wait_status[lcore_id];
>>> +
>>> +/* update sleep address */
>>> +rte_spinlock_lock(&s->lock);
>>> +s->monitor_addr = pmc->addr;
>>> +rte_spinlock_unlock(&s->lock);
>>
>> It was a while, since I looked at it last time,
>> but shouldn't we grab the lock before monitor()?
>> I.E:
>> lock();
>> monitor();
>> addr=...;
>> unlock();
>> umwait();
>>
> 
> I don't believe so.
> 
> The idea here is to only store the address when we are looking to sleep, 
> and avoid the locks entirely if we already know we aren't going to 
> sleep. I mean, technically we could lock unconditionally, then unlock 
> when we're done, but there's very little practical difference between 
> the two because the moment we are interested in (sleep) happens the same 
> way whether we lock before or after monitor().

On another thought, putting the lock before monitor() and unlocking 
afterwards allows us to ask for a wakeup earlier, without necessarily 
waiting for a sleep. So think i'll take your suggestion on board anyway, 
thanks!

> 
>>> +
>>>   /* execute UMWAIT */
>>>   asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
>>>   : /* ignore rflags */
>>>   : "D"(0), /* enter C0.2 */
>>>     "a"(tsc_l), "d"(tsc_h));
>>> +
>>> +/* erase sleep address */
>>> +rte_spinlock_lock(&s->lock);
>>> +s->monitor_addr = NULL;
>>> +rte_spinlock_unlock(&s->lock);
>>>   }
>>>
>>>   /**
>>> @@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) {
>>>   if (i.power_monitor && i.power_pause)
>>>   wait_supported = 1;
>>>   }
>>> +
>>> +void
>>> +rte_power_monitor_wakeup(const unsigned int lcore_id)
>>> +{
>>> +struct power_wait_status *s;
>>> +
>>> +/* prevent buffer overrun */
>>> +if (lcore_id >= RTE_MAX_LCORE)
>>> +return;
>>> +
>>> +/* prevent user from running this instruction if it's not supported */
>>> +if (!wait_supported)
>>> +return;
>>> +
>>> +s = &wait_status[lcore_id];
>>> +
>>> +/*
>>> + * There is a race condition between sleep, wakeup and locking, but we
>>> + * don't need to handle it.
>>> + *
>>> + * Possible situations:
>>> + *
>>> + * 1. T1 locks, sets address, unlocks
>>> + * 2. T2 locks, triggers wakeup, unlocks
>>> + * 3. T1 sleeps
>>> + *
>>> + * In this case, because T1 has already set the address for monitoring,
>>> + * we will wake up immediately even if T2 triggers wakeup before T1
>>> + * goes to sleep.
>>> + *
>>> + * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up
>>> + * 2. T2 locks, triggers wakeup, and unlocks
>>> + * 3. T1 locks, erases address, and unlocks
>>> + *
>>> + * In this case, since we've already woken up, the "wakeup" was
>>> + * unneeded, and since T1 is still waiting on T2 releasing the lock, 
>>> the
>>> + * wakeup address is still valid so it's perfectly safe to write it.
>>> + */
>>> +rte_spinlock_lock(&s->lock);
>>> +if (s->monitor_addr != NULL)
>>> +__umwait_wakeup(s->monitor_addr);
>>> +rte_spinlock_unlock(&s->lock);
>>> +}
>>> -- 
>>> 2.25.1
> 
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 00/11] Add PMD power management
  2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
                               ` (10 preceding siblings ...)
  2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2021-01-12 17:37             ` Anatoly Burakov
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 01/11] eal: uninline power intrinsics Anatoly Burakov
                                 ` (12 more replies)
  11 siblings, 13 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt,
	bruce.richardson, chris.macnamara

This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. There are multiple proposed mechanisms to achieve said power
savings: simple frequency scaling, idle loop, and monitoring the Rx
queue for incoming packages. The latter is achieved through cooperation
with the NIC driver that will allow us to know address of wake up event,
and wait for writes on that address.

On IA, this is achieved through using UMONITOR/UMWAIT instructions. They 
are used in their raw opcode form because there is no widespread 
compiler support for them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen to implement 
similar instructions.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we employ
one of the suggested power management schemes automatically, from within
a Rx callback inside the PMD. Once there's traffic again, the empty poll
counter is reset.

This patchset also introduces a few changes into existing power 
management-related intrinsics, namely to provide a native way of waking 
up a sleeping core without application being responsible for it, as well 
as general robustness improvements. There's quite a bit of locking going 
on, but these locks are per-thread and very little (if any) contention 
is expected, so the performance impact shouldn't be that bad (and in any 
case the locking happens when we're about to sleep anyway).

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it with
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

Things of note:

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

v16:
- Implemented Konstantin's suggestions and comments
- Added return values to the API

v15:
- Fixed incorrect check in UMWAIT callback
- Fixed accidental whitespace changes

v14:
- Fixed ARM/PPC builds
- Addressed various review comments

v13:
- Reworked the librte_power code to require less locking and handle invalid
  parameters better
- Fix numerous rebase errors present in v12

v12:
- Rebase on top of 21.02
- Rework of power intrinsics code

Anatoly Burakov (5):
  eal: uninline power intrinsics
  eal: avoid invalid API usage in power intrinsics
  eal: change API of power intrinsics
  eal: remove sync version of power monitor
  eal: add monitor wakeup function

Liang Ma (6):
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  44 +++
 doc/guides/rel_notes/release_21_02.rst        |  15 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 drivers/event/dlb/dlb.c                       |  10 +-
 drivers/event/dlb2/dlb2.c                     |  10 +-
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  25 ++
 drivers/net/i40e/i40e_rxtx.h                  |   1 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   1 +
 examples/l3fwd-power/main.c                   |  89 ++++-
 .../arm/include/rte_power_intrinsics.h        |  40 --
 lib/librte_eal/arm/meson.build                |   1 +
 lib/librte_eal/arm/rte_power_intrinsics.c     |  38 ++
 .../include/generic/rte_power_intrinsics.h    |  88 ++---
 .../ppc/include/rte_power_intrinsics.h        |  40 --
 lib/librte_eal/ppc/meson.build                |   1 +
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  38 ++
 lib/librte_eal/version.map                    |   3 +
 .../x86/include/rte_power_intrinsics.h        | 115 ------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 220 +++++++++++
 lib/librte_ethdev/rte_ethdev.c                |  28 ++
 lib/librte_ethdev/rte_ethdev.h                |  25 ++
 lib/librte_ethdev/rte_ethdev_driver.h         |  22 ++
 lib/librte_ethdev/version.map                 |   3 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 359 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  90 +++++
 lib/librte_power/version.map                  |   5 +
 34 files changed, 1148 insertions(+), 259 deletions(-)
 create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 01/11] eal: uninline power intrinsics
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 02/11] eal: avoid invalid API usage in " Anatoly Burakov
                                 ` (11 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, power intrinsics are inline functions. Make them part of the
ABI so that we can have various internal data associated with them
without exposing said data to the outside world.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v14:
    - Fix compile issues on ARM and PPC64 by moving implementations to .c files

 .../arm/include/rte_power_intrinsics.h        |  40 ------
 lib/librte_eal/arm/meson.build                |   1 +
 lib/librte_eal/arm/rte_power_intrinsics.c     |  45 +++++++
 .../include/generic/rte_power_intrinsics.h    |   6 +-
 .../ppc/include/rte_power_intrinsics.h        |  40 ------
 lib/librte_eal/ppc/meson.build                |   1 +
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  45 +++++++
 lib/librte_eal/version.map                    |   3 +
 .../x86/include/rte_power_intrinsics.h        | 115 -----------------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 120 ++++++++++++++++++
 11 files changed, 219 insertions(+), 198 deletions(-)
 create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index a4a1bc1159..9e498e9ebf 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -13,46 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	RTE_SET_USED(tsc_timestamp);
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build
index d62875ebae..6ec53ea03a 100644
--- a/lib/librte_eal/arm/meson.build
+++ b/lib/librte_eal/arm/meson.build
@@ -7,4 +7,5 @@ sources += files(
 	'rte_cpuflags.c',
 	'rte_cycles.c',
 	'rte_hypervisor.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
new file mode 100644
index 0000000000..ab1f44f611
--- /dev/null
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2021 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index dd520d90fa..67977bd511 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -52,7 +52,7 @@
  *   to undefined result.
  */
 __rte_experimental
-static inline void rte_power_monitor(const volatile void *p,
+void rte_power_monitor(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz);
 
@@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p,
  *   wakes up.
  */
 __rte_experimental
-static inline void rte_power_monitor_sync(const volatile void *p,
+void rte_power_monitor_sync(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz,
 		rte_spinlock_t *lck);
@@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  *   architecture-dependent.
  */
 __rte_experimental
-static inline void rte_power_pause(const uint64_t tsc_timestamp);
+void rte_power_pause(const uint64_t tsc_timestamp);
 
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 4ed03d521f..c0e9ac279f 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -13,46 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	RTE_SET_USED(tsc_timestamp);
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/ppc/meson.build b/lib/librte_eal/ppc/meson.build
index f4b6d95c42..43c46542fb 100644
--- a/lib/librte_eal/ppc/meson.build
+++ b/lib/librte_eal/ppc/meson.build
@@ -7,4 +7,5 @@ sources += files(
 	'rte_cpuflags.c',
 	'rte_cycles.c',
 	'rte_hypervisor.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
new file mode 100644
index 0000000000..84340ca2a4
--- /dev/null
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2021 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index b1db7ec795..32eceb8869 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -405,6 +405,9 @@ EXPERIMENTAL {
 	rte_vect_set_max_simd_bitwidth;
 
 	# added in 21.02
+	rte_power_monitor;
+	rte_power_monitor_sync;
+	rte_power_pause;
 	rte_thread_tls_key_create;
 	rte_thread_tls_key_delete;
 	rte_thread_tls_value_get;
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
index c7d790c854..e4c2b87f73 100644
--- a/lib/librte_eal/x86/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -13,121 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-static inline uint64_t
-__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz)
-{
-	switch (sz) {
-	case sizeof(uint8_t):
-		return *(const volatile uint8_t *)p;
-	case sizeof(uint16_t):
-		return *(const volatile uint16_t *)p;
-	case sizeof(uint32_t):
-		return *(const volatile uint32_t *)p;
-	case sizeof(uint64_t):
-		return *(const volatile uint64_t *)p;
-	default:
-		/* this is an intrinsic, so we can't have any error handling */
-		RTE_ASSERT(0);
-		return 0;
-	}
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-}
-
-/**
- * This function uses TPAUSE instruction  and will enter C0.2 state. For more
- * information about usage of this instruction, please refer to Intel(R) 64 and
- * IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* execute TPAUSE */
-	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
-		: /* ignore rflags */
-		: "D"(0), /* enter C0.2 */
-		  "a"(tsc_l), "d"(tsc_h));
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build
index e78f29002e..dfd42dee0c 100644
--- a/lib/librte_eal/x86/meson.build
+++ b/lib/librte_eal/x86/meson.build
@@ -8,4 +8,5 @@ sources += files(
 	'rte_cycles.c',
 	'rte_hypervisor.c',
 	'rte_spinlock.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
new file mode 100644
index 0000000000..34c5fd9c3e
--- /dev/null
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -0,0 +1,120 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+static inline uint64_t
+__get_umwait_val(const volatile void *p, const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):
+		return *(const volatile uint8_t *)p;
+	case sizeof(uint16_t):
+		return *(const volatile uint16_t *)p;
+	case sizeof(uint32_t):
+		return *(const volatile uint32_t *)p;
+	case sizeof(uint64_t):
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		RTE_ASSERT(0);
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			"a"(tsc_l), "d"(tsc_h));
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 02/11] eal: avoid invalid API usage in power intrinsics
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 01/11] eal: uninline power intrinsics Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 03/11] eal: change API of " Anatoly Burakov
                                 ` (10 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel,
	david.hunt, chris.macnamara

Currently, the API documentation mandates that if the user wants to use
the power management intrinsics, they need to call the
`rte_cpu_get_intrinsics_support` API and check support for specific
intrinsics.

However, if the user does not do that, it is possible to get illegal
instruction error because we're using raw instruction opcodes, which may
or may not be supported at runtime.

Now that we have everything in a C file, we can check for support at
startup and prevent the user from possibly encountering illegal
instruction errors.

We also add return values to the API's as well, because why not.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v16:
    - Add return values and proper error handling to the API
    
    v15:
    - Remove accidental whitespace changes
    
    v14:
    - Replace uint8_t with bool
    
    v14:
    - Replace uint8_t with bool

 lib/librte_eal/arm/rte_power_intrinsics.c     | 12 +++-
 .../include/generic/rte_power_intrinsics.h    | 24 +++++--
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 12 +++-
 lib/librte_eal/x86/rte_power_intrinsics.c     | 64 +++++++++++++++++--
 4 files changed, 94 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index ab1f44f611..7e7552fa8a 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -7,7 +7,7 @@
 /**
  * This function is not supported on ARM.
  */
-void
+int
 rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz)
@@ -17,12 +17,14 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	RTE_SET_USED(value_mask);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(data_sz);
+
+	return -ENOTSUP;
 }
 
 /**
  * This function is not supported on ARM.
  */
-void
+int
 rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz, rte_spinlock_t *lck)
@@ -33,13 +35,17 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
 	RTE_SET_USED(data_sz);
+
+	return -ENOTSUP;
 }
 
 /**
  * This function is not supported on ARM.
  */
-void
+int
 rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
 }
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 67977bd511..37e4ec0414 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -34,7 +34,6 @@
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -50,9 +49,14 @@
  *   Data size (in bytes) that will be used to compare expected value with the
  *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
  *   to undefined result.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
  */
 __rte_experimental
-void rte_power_monitor(const volatile void *p,
+int rte_power_monitor(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz);
 
@@ -75,7 +79,6 @@ void rte_power_monitor(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -95,9 +98,14 @@ void rte_power_monitor(const volatile void *p,
  *   A spinlock that must be locked before entering the function, will be
  *   unlocked while the CPU is sleeping, and will be locked again once the CPU
  *   wakes up.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
  */
 __rte_experimental
-void rte_power_monitor_sync(const volatile void *p,
+int rte_power_monitor_sync(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz,
 		rte_spinlock_t *lck);
@@ -111,13 +119,17 @@ void rte_power_monitor_sync(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
  */
 __rte_experimental
-void rte_power_pause(const uint64_t tsc_timestamp);
+int rte_power_pause(const uint64_t tsc_timestamp);
 
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 84340ca2a4..929e0611b0 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -7,7 +7,7 @@
 /**
  * This function is not supported on PPC64.
  */
-void
+int
 rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz)
@@ -17,12 +17,14 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	RTE_SET_USED(value_mask);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(data_sz);
+
+	return -ENOTSUP;
 }
 
 /**
  * This function is not supported on PPC64.
  */
-void
+int
 rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz, rte_spinlock_t *lck)
@@ -33,13 +35,17 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
 	RTE_SET_USED(data_sz);
+
+	return -ENOTSUP;
 }
 
 /**
  * This function is not supported on PPC64.
  */
-void
+int
 rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
 }
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 34c5fd9c3e..2a38440bec 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,8 @@
 
 #include "rte_power_intrinsics.h"
 
+static bool wait_supported;
+
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
 {
@@ -17,24 +19,47 @@ __get_umwait_val(const volatile void *p, const uint8_t sz)
 	case sizeof(uint64_t):
 		return *(const volatile uint64_t *)p;
 	default:
-		/* this is an intrinsic, so we can't have any error handling */
+		/* shouldn't happen */
 		RTE_ASSERT(0);
 		return 0;
 	}
 }
 
+static inline int
+__check_val_size(const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):  /* fall-through */
+	case sizeof(uint16_t): /* fall-through */
+	case sizeof(uint32_t): /* fall-through */
+	case sizeof(uint64_t): /* fall-through */
+		return 0;
+	default:
+		/* unexpected size */
+		return -1;
+	}
+}
+
 /**
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
  * For more information about usage of these instructions, please refer to
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
-void
+int
 rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return -ENOTSUP;
+
+	if (__check_val_size(data_sz) < 0)
+		return -EINVAL;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -51,13 +76,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 
 		/* if the masked value is already matching, abort */
 		if (masked == expected_value)
-			return;
+			return 0;
 	}
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			  "a"(tsc_l), "d"(tsc_h));
+
+	return 0;
 }
 
 /**
@@ -65,13 +92,21 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * For more information about usage of these instructions, please refer to
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
-void
+int
 rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz, rte_spinlock_t *lck)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return -ENOTSUP;
+
+	if (__check_val_size(data_sz) < 0)
+		return -EINVAL;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -88,7 +123,7 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 
 		/* if the masked value is already matching, abort */
 		if (masked == expected_value)
-			return;
+			return 0;
 	}
 	rte_spinlock_unlock(lck);
 
@@ -99,6 +134,8 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 			  "a"(tsc_l), "d"(tsc_h));
 
 	rte_spinlock_lock(lck);
+
+	return 0;
 }
 
 /**
@@ -106,15 +143,30 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
  * information about usage of this instruction, please refer to Intel(R) 64 and
  * IA-32 Architectures Software Developer's Manual.
  */
-void
+int
 rte_power_pause(const uint64_t tsc_timestamp)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
 
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return -ENOTSUP;
+
 	/* execute TPAUSE */
 	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			"a"(tsc_l), "d"(tsc_h));
+
+	return 0;
+}
+
+RTE_INIT(rte_power_intrinsics_init) {
+	struct rte_cpu_intrinsics i;
+
+	rte_cpu_get_intrinsics_support(&i);
+
+	if (i.power_monitor && i.power_pause)
+		wait_supported = 1;
 }
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 01/11] eal: uninline power intrinsics Anatoly Burakov
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 02/11] eal: avoid invalid API usage in " Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-13 13:01                 ` Ananyev, Konstantin
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 04/11] eal: remove sync version of power monitor Anatoly Burakov
                                 ` (9 subsequent siblings)
  12 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Timothy McDaniel, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
	David Christensen, Bruce Richardson, Konstantin Ananyev, thomas,
	david.hunt, chris.macnamara

Instead of passing around pointers and integers, collect everything
into struct. This makes API design around these intrinsics much easier.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v16:
    - Add error handling

 drivers/event/dlb/dlb.c                       | 10 ++--
 drivers/event/dlb2/dlb2.c                     | 10 ++--
 lib/librte_eal/arm/rte_power_intrinsics.c     | 20 +++-----
 .../include/generic/rte_power_intrinsics.h    | 50 ++++++++-----------
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 20 +++-----
 lib/librte_eal/x86/rte_power_intrinsics.c     | 42 +++++++++-------
 6 files changed, 70 insertions(+), 82 deletions(-)

diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c
index 0c95c4793d..d2f2026291 100644
--- a/drivers/event/dlb/dlb.c
+++ b/drivers/event/dlb/dlb.c
@@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		/* Interrupts not supported by PF PMD */
 		return 1;
 	} else if (dlb->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 86724863f2..c9a8a02278 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 	if (elapsed_ticks >= timeout) {
 		return 1;
 	} else if (dlb2->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb2_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index 7e7552fa8a..5f1caaf25b 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -8,15 +8,11 @@
  * This function is not supported on ARM.
  */
 int
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 
 	return -ENOTSUP;
 }
@@ -25,16 +21,12 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * This function is not supported on ARM.
  */
 int
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 
 	return -ENOTSUP;
 }
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 37e4ec0414..3ad53068d5 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -18,6 +18,18 @@
  * which are architecture-dependent.
  */
 
+struct rte_power_monitor_cond {
+	volatile void *addr;  /**< Address to monitor for changes */
+	uint64_t val;         /**< Before attempting the monitoring, the address
+	                       *   may be read and compared against this value.
+	                       **/
+	uint64_t mask;   /**< 64-bit mask to extract current value from addr */
+	uint8_t data_sz; /**< Data size (in bytes) that will be used to compare
+	                  *   expected value with the memory address. Can be 1,
+	                  *   2, 4, or 8. Supplying any other value will lead to
+	                  *   undefined result. */
+};
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
@@ -35,20 +47,11 @@
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  *
  * @return
  *   0 on success
@@ -56,10 +59,8 @@
  *   -ENOTSUP if unsupported
  */
 __rte_experimental
-int rte_power_monitor(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz);
-
+int rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp);
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
@@ -80,20 +81,11 @@ int rte_power_monitor(const volatile void *p,
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  * @param lck
  *   A spinlock that must be locked before entering the function, will be
  *   unlocked while the CPU is sleeping, and will be locked again once the CPU
@@ -105,10 +97,8 @@ int rte_power_monitor(const volatile void *p,
  *   -ENOTSUP if unsupported
  */
 __rte_experimental
-int rte_power_monitor_sync(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz,
-		rte_spinlock_t *lck);
+int rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
 
 /**
  * @warning
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 929e0611b0..5e5a1fff5a 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -8,15 +8,11 @@
  * This function is not supported on PPC64.
  */
 int
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 
 	return -ENOTSUP;
 }
@@ -25,16 +21,12 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * This function is not supported on PPC64.
  */
 int
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 
 	return -ENOTSUP;
 }
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 2a38440bec..6be5c8b9f1 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -46,9 +46,8 @@ __check_val_size(const uint8_t sz)
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 int
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -57,7 +56,10 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	if (!wait_supported)
 		return -ENOTSUP;
 
-	if (__check_val_size(data_sz) < 0)
+	if (pmc == NULL)
+		return -EINVAL;
+
+	if (__check_val_size(pmc->data_sz) < 0)
 		return -EINVAL;
 
 	/*
@@ -68,14 +70,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return 0;
 	}
 	/* execute UMWAIT */
@@ -93,9 +96,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 int
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -104,7 +106,10 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	if (!wait_supported)
 		return -ENOTSUP;
 
-	if (__check_val_size(data_sz) < 0)
+	if (pmc == NULL || lck == NULL)
+		return -EINVAL;
+
+	if (__check_val_size(pmc->data_sz) < 0)
 		return -EINVAL;
 
 	/*
@@ -115,14 +120,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return 0;
 	}
 	rte_spinlock_unlock(lck);
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 04/11] eal: remove sync version of power monitor
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
                                 ` (2 preceding siblings ...)
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 03/11] eal: change API of " Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function Anatoly Burakov
                                 ` (8 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, the "sync" version of power monitor intrinsic is supposed to
be used for purposes of waking up a sleeping core. However, there are
better ways to achieve the same result, so remove the unneeded function.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_eal/arm/rte_power_intrinsics.c     | 14 -----
 .../include/generic/rte_power_intrinsics.h    | 38 -------------
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 14 -----
 lib/librte_eal/version.map                    |  1 -
 lib/librte_eal/x86/rte_power_intrinsics.c     | 54 -------------------
 5 files changed, 121 deletions(-)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index 5f1caaf25b..8d271dc0c1 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -17,20 +17,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	return -ENOTSUP;
 }
 
-/**
- * This function is not supported on ARM.
- */
-int
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-
-	return -ENOTSUP;
-}
-
 /**
  * This function is not supported on ARM.
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 3ad53068d5..85343bc9eb 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -61,44 +61,6 @@ struct rte_power_monitor_cond {
 __rte_experimental
 int rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
-/**
- * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
- *
- * Monitor specific address for changes. This will cause the CPU to enter an
- * architecture-defined optimized power state until either the specified
- * memory address is written to, a certain TSC timestamp is reached, or other
- * reasons cause the CPU to wake up.
- *
- * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
- * mask is non-zero, the current value pointed to by the `p` pointer will be
- * checked against the expected value, and if they match, the entering of
- * optimized power state may be aborted.
- *
- * This call will also lock a spinlock on entering sleep, and release it on
- * waking up the CPU.
- *
- * @warning It is responsibility of the user to check if this function is
- *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *
- * @param pmc
- *   The monitoring condition structure.
- * @param tsc_timestamp
- *   Maximum TSC timestamp to wait for. Note that the wait behavior is
- *   architecture-dependent.
- * @param lck
- *   A spinlock that must be locked before entering the function, will be
- *   unlocked while the CPU is sleeping, and will be locked again once the CPU
- *   wakes up.
- *
- * @return
- *   0 on success
- *   -EINVAL on invalid parameters
- *   -ENOTSUP if unsupported
- */
-__rte_experimental
-int rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
 
 /**
  * @warning
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 5e5a1fff5a..f7862ea324 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -17,20 +17,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	return -ENOTSUP;
 }
 
-/**
- * This function is not supported on PPC64.
- */
-int
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-
-	return -ENOTSUP;
-}
-
 /**
  * This function is not supported on PPC64.
  */
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 32eceb8869..1fcd1d3bed 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,7 +406,6 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
-	rte_power_monitor_sync;
 	rte_power_pause;
 	rte_thread_tls_key_create;
 	rte_thread_tls_key_delete;
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 6be5c8b9f1..29247d8638 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -90,60 +90,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	return 0;
 }
 
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-int
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* prevent user from running this instruction if it's not supported */
-	if (!wait_supported)
-		return -ENOTSUP;
-
-	if (pmc == NULL || lck == NULL)
-		return -EINVAL;
-
-	if (__check_val_size(pmc->data_sz) < 0)
-		return -EINVAL;
-
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(pmc->addr));
-
-	if (pmc->mask) {
-		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->data_sz);
-		const uint64_t masked = cur_value & pmc->mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
-			return 0;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-
-	return 0;
-}
-
 /**
  * This function uses TPAUSE instruction  and will enter C0.2 state. For more
  * information about usage of this instruction, please refer to Intel(R) 64 and
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
                                 ` (3 preceding siblings ...)
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 04/11] eal: remove sync version of power monitor Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-13 12:46                 ` Ananyev, Konstantin
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API Anatoly Burakov
                                 ` (7 subsequent siblings)
  12 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Now that we have everything in a C file, we can store the information
about our sleep, and have a native mechanism to wake up the sleeping
core. This mechanism would however only wake up a core that's sleeping
while monitoring - waking up from `rte_power_pause` won't work.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v16:
    - Improve error handling
    - Take a lock before UMONITOR
    
    v13:
    - Add comments around wakeup code to explain what it does
    - Add lcore_id parameter checking to prevent buffer overrun

 lib/librte_eal/arm/rte_power_intrinsics.c     |  9 ++
 .../include/generic/rte_power_intrinsics.h    | 16 +++
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  9 ++
 lib/librte_eal/version.map                    |  1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 98 ++++++++++++++++++-
 5 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index 8d271dc0c1..5a24c13913 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -27,3 +27,12 @@ rte_power_pause(const uint64_t tsc_timestamp)
 
 	return -ENOTSUP;
 }
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+}
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 85343bc9eb..6109d28faa 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -62,6 +62,22 @@ __rte_experimental
 int rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Wake up a specific lcore that is in a power optimized state and is monitoring
+ * an address.
+ *
+ * @note This function will *not* wake up a core that is in a power optimized
+ *   state due to calling `rte_power_pause`.
+ *
+ * @param lcore_id
+ *   Lcore ID of a sleeping thread.
+ */
+__rte_experimental
+int rte_power_monitor_wakeup(const unsigned int lcore_id);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index f7862ea324..7e334f7cf0 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -27,3 +27,12 @@ rte_power_pause(const uint64_t tsc_timestamp)
 
 	return -ENOTSUP;
 }
+
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 1fcd1d3bed..fce90a112f 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,6 +406,7 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
+	rte_power_monitor_wakeup;
 	rte_power_pause;
 	rte_thread_tls_key_create;
 	rte_thread_tls_key_delete;
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 29247d8638..a9e1689f75 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -2,8 +2,31 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
+#include <rte_common.h>
+#include <rte_lcore.h>
+#include <rte_spinlock.h>
+
 #include "rte_power_intrinsics.h"
 
+/*
+ * Per-lcore structure holding current status of C0.2 sleeps.
+ */
+static struct power_wait_status {
+	rte_spinlock_t lock;
+	volatile void *monitor_addr; /**< NULL if not currently sleeping */
+} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+
+static inline void
+__umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	/* trigger a write but don't change the value */
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
 static bool wait_supported;
 
 static inline uint64_t
@@ -51,17 +74,29 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s;
 
 	/* prevent user from running this instruction if it's not supported */
 	if (!wait_supported)
 		return -ENOTSUP;
 
+	/* prevent non-EAL thread from using this API */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
 	if (pmc == NULL)
 		return -EINVAL;
 
 	if (__check_val_size(pmc->data_sz) < 0)
 		return -EINVAL;
 
+	s = &wait_status[lcore_id];
+
+	/* update sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = pmc->addr;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -72,21 +107,37 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 			:
 			: "D"(pmc->addr));
 
+	/* now that we've put this address into monitor, we can unlock */
+	rte_spinlock_unlock(&s->lock);
+
+	/* if we have a comparison mask, we might not need to sleep at all */
 	if (pmc->mask) {
 		const uint64_t cur_value = __get_umwait_val(
 				pmc->addr, pmc->data_sz);
 		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
+		if (masked == pmc->val) {
+			/* erase sleep address */
+			rte_spinlock_lock(&s->lock);
+			s->monitor_addr = NULL;
+			rte_spinlock_unlock(&s->lock);
+
 			return 0;
+		}
 	}
+
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			  "a"(tsc_l), "d"(tsc_h));
 
+	/* erase sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = NULL;
+	rte_spinlock_unlock(&s->lock);
+
 	return 0;
 }
 
@@ -122,3 +173,48 @@ RTE_INIT(rte_power_intrinsics_init) {
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
 }
+
+int
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	struct power_wait_status *s;
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return -ENOTSUP;
+
+	/* prevent buffer overrun */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	s = &wait_status[lcore_id];
+
+	/*
+	 * There is a race condition between sleep, wakeup and locking, but we
+	 * don't need to handle it.
+	 *
+	 * Possible situations:
+	 *
+	 * 1. T1 locks, sets address, unlocks
+	 * 2. T2 locks, triggers wakeup, unlocks
+	 * 3. T1 sleeps
+	 *
+	 * In this case, because T1 has already set the address for monitoring,
+	 * we will wake up immediately even if T2 triggers wakeup before T1
+	 * goes to sleep.
+	 *
+	 * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up
+	 * 2. T2 locks, triggers wakeup, and unlocks
+	 * 3. T1 locks, erases address, and unlocks
+	 *
+	 * In this case, since we've already woken up, the "wakeup" was
+	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
+	 * wakeup address is still valid so it's perfectly safe to write it.
+	 */
+	rte_spinlock_lock(&s->lock);
+	if (s->monitor_addr != NULL)
+		__umwait_wakeup(s->monitor_addr);
+	rte_spinlock_unlock(&s->lock);
+
+	return 0;
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
                                 ` (4 preceding siblings ...)
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-13 13:18                 ` Ananyev, Konstantin
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback Anatoly Burakov
                                 ` (6 subsequent siblings)
  12 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, konstantin.ananyev, timothy.mcdaniel,
	david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple API to allow getting the monitor conditions for
power-optimized monitoring of the Rx queues from the PMD, as well as
release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---

Notes:
    v13:
    - Fix typos and issues raised by Andrew

 doc/guides/rel_notes/release_21_02.rst |  5 +++++
 lib/librte_ethdev/rte_ethdev.c         | 28 ++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h         | 25 +++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h  | 22 ++++++++++++++++++++
 lib/librte_ethdev/version.map          |  3 +++
 5 files changed, 83 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index 706cbf8f0c..ec9958a141 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **ethdev: added new API for PMD power management**
+
+  * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
+    ``rte_power_monitor()`` to enable automatic power management for PMD's.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 17ddacc78d..e19dbd838b 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	if (pmc == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n",
+				pmc);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id],
+			pmc));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index f5f8919186..ca0f91312e 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -4334,6 +4335,30 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Retrieve the monitor condition for a given receive queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param pmc
+ *   The pointer point to power-optimized monitoring condition structure.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index 0eacfd8425..3b3b0ec1a0 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
 	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
 /**< @internal Unbind peer queue from the current queue. */
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an Rx queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param pmc
+ *   The pointer to power-optimized monitoring condition structure.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_monitor_addr_t)(void *rxq,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -917,6 +937,8 @@ struct eth_dev_ops {
 	/**< Set up the connection between the pair of hairpin queues. */
 	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
 	/**< Disconnect the hairpin queues of a pair from each other. */
+	eth_get_monitor_addr_t get_monitor_addr;
+	/**< Get power monitoring condition for Rx queue. */
 };
 
 /**
diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
index d3f5410806..a124e1e370 100644
--- a/lib/librte_ethdev/version.map
+++ b/lib/librte_ethdev/version.map
@@ -240,6 +240,9 @@ EXPERIMENTAL {
 	rte_flow_get_restore_info;
 	rte_flow_tunnel_action_decap_release;
 	rte_flow_tunnel_item_release;
+
+	# added in 21.02
+	rte_eth_get_monitor_addr;
 };
 
 INTERNAL {
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
                                 ` (5 preceding siblings ...)
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-13 12:58                 ` Ananyev, Konstantin
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 08/11] net/ixgbe: implement power management API Anatoly Burakov
                                 ` (5 subsequent siblings)
  12 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas,
	konstantin.ananyev, timothy.mcdaniel, bruce.richardson,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. TPAUSE/Pause instruction

   This method uses the pause (or TPAUSE, if available) instruction to
   avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v15:
    - Fix check in UMWAIT callback
    
    v13:
    - Rework the synchronization mechanism to not require locking
    - Add more parameter checking
    - Rework n_rx_queues access to not go through internal PMD structures and use
      public API instead
    
    v13:
    - Rework the synchronization mechanism to not require locking
    - Add more parameter checking
    - Rework n_rx_queues access to not go through internal PMD structures and use
      public API instead

 doc/guides/prog_guide/power_man.rst    |  44 +++
 doc/guides/rel_notes/release_21_02.rst |  10 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 359 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  90 +++++++
 lib/librte_power/version.map           |   5 +
 6 files changed, 511 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..02280dd689 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,47 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * Monitor
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX
+   descriptor address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not available, use ``rte_pause()``.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing ``librte_power`` library
+   functionality to scale the core frequency up/down depending on traffic
+   volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +241,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index ec9958a141..9cd8214e2d 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -60,6 +60,16 @@ New Features
   * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
     ``rte_power_monitor()`` to enable automatic power management for PMD's.
 
+* **Add PMD power management helper API**
+
+  A new helper API has been added to make using Ethernet PMD power management
+  easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power
+  management schemes are supported initially:
+
+  * Power saving based on UMWAIT instruction (x86 only)
+  * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only)
+  * Power saving based on frequency scaling through the ``librte_power`` library
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 4b4cf1b90b..51a471b669 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..470c3a912b
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,359 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+static struct pmd_conf_data {
+	struct rte_cpu_intrinsics intrinsics_support;
+	/**< what do we support? */
+	uint64_t tsc_per_us;
+	/**< pre-calculated tsc diff for 1us */
+	uint64_t pause_per_us;
+	/**< how many rte_pause can we fit in a microisecond? */
+} global_data;
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+	/** Device powermanagement status is about to change. */
+	PMD_MGMT_BUSY
+};
+
+struct pmd_queue_cfg {
+	volatile enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	volatile bool umwait_in_progress;
+	/**< are we currently sleeping? */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+static void
+calc_tsc(void)
+{
+	const uint64_t hz = rte_get_timer_hz();
+	const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */
+
+	global_data.tsc_per_us = tsc_per_us;
+
+	/* only do this if we don't have tpause */
+	if (!global_data.intrinsics_support.power_pause) {
+		const uint64_t start = rte_rdtsc_precise();
+		const uint32_t n_pauses = 10000;
+		double us, us_per_pause;
+		uint64_t end;
+		unsigned int i;
+
+		/* estimate number of rte_pause() calls per us*/
+		for (i = 0; i < n_pauses; i++)
+			rte_pause();
+
+		end = rte_rdtsc_precise();
+		us = (end - start) / (double)tsc_per_us;
+		us_per_pause = us / n_pauses;
+
+		global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause);
+	}
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc;
+			uint16_t ret;
+
+			/*
+			 * we might get a cancellation request while being
+			 * inside the callback, in which case the wakeup
+			 * wouldn't work because it would've arrived too early.
+			 *
+			 * to get around this, we notify the other thread that
+			 * we're sleeping, so that it can spin until we're done.
+			 * unsolicited wakeups are perfectly safe.
+			 */
+			q_conf->umwait_in_progress = true;
+
+			/* check if we need to cancel sleep */
+			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+				/* use monitoring condition to sleep */
+				ret = rte_eth_get_monitor_addr(port_id, qidx,
+						&pmc);
+				if (ret == 0)
+					rte_power_monitor(&pmc, -1ULL);
+			}
+			q_conf->umwait_in_progress = false;
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/* use tpause if we have it */
+			if (global_data.intrinsics_support.power_pause) {
+				const uint64_t cur = rte_rdtsc();
+				const uint64_t wait_tsc =
+						cur + global_data.tsc_per_us;
+				rte_power_pause(wait_tsc);
+			} else {
+				uint64_t i;
+				for (i = 0; i < global_data.pause_per_us; i++)
+					rte_pause();
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
+		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	struct rte_eth_dev_info info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if (rte_eth_dev_info_get(port_id, &info) < 0) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* check if queue id is valid */
+	if (queue_id >= info.nb_rx_queues) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* we're about to change our state */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
+
+	/* we need this in various places */
+	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		struct rte_power_monitor_cond dummy;
+
+		/* check if rte_power_monitor is supported */
+		if (!global_data.intrinsics_support.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_monitor_addr(port_id, queue_id,
+				&dummy) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->umwait_in_progress = false;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto rollback;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto rollback;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* figure out various time-to-tsc conversions */
+		if (global_data.tsc_per_us == 0)
+			calc_tsc();
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+	return ret;
+
+rollback:
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	/* let the callback know we're shutting down */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		bool exit = false;
+		do {
+			/*
+			 * we may request cancellation while the other thread
+			 * has just entered the callback but hasn't started
+			 * sleeping yet, so keep waking it up until we know it's
+			 * done sleeping.
+			 */
+			if (queue_cfg->umwait_in_progress)
+				rte_power_monitor_wakeup(lcore_id);
+			else
+				exit = true;
+		} while (!exit);
+	}
+	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..0bfbc6ba69
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** Use power-optimized monitoring to wait for incoming traffic */
+	RTE_POWER_MGMT_TYPE_MONITOR = 1,
+	/** Use power-optimized sleep to avoid busy polling */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Use frequency scaling when traffic is low */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..61996b4d11 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,9 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+
+	# added in 21.02
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 08/11] net/ixgbe: implement power management API
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
                                 ` (6 preceding siblings ...)
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 09/11] net/i40e: " Anatoly Burakov
                                 ` (4 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jeff Guo, Haiyue Wang, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index d7a1806ab8..97acf35d24 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -560,6 +560,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_monitor_addr     = ixgbe_get_monitor_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 7bb8460359..cc8f70e6dd 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int
+ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	pmc->data_sz = sizeof(uint32_t);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 6d2f7c9da3..8a25e98df6 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,6 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 09/11] net/i40e: implement power management API
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
                                 ` (7 preceding siblings ...)
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 08/11] net/ixgbe: implement power management API Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 10/11] net/ice: " Anatoly Burakov
                                 ` (3 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Beilei Xing, Jeff Guo, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 14622484a0..ba1abc584f 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -510,6 +510,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_monitor_addr             = i40e_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5df9a9df56..0b4220fc9c 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -72,6 +72,31 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	pmc->data_sz = sizeof(uint64_t);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..e1494525ce 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,7 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 10/11] net/ice: implement power management API
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
                                 ` (8 preceding siblings ...)
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 09/11] net/i40e: " Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
                                 ` (2 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Qiming Yang, Qi Zhang, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  1 +
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 587f485ee3..38c6263946 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_monitor_addr             = ice_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index d052bd0f1b..066651dc48 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int
+ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	pmc->data_sz = sizeof(uint16_t);
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 6b16716063..906fbefdc4 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -263,6 +263,7 @@ uint16_t ice_xmit_pkts_vec_avx512(void *tx_queue, struct rte_mbuf **tx_pkts,
 				  uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v16 11/11] examples/l3fwd-power: enable PMD power mgmt
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
                                 ` (9 preceding siblings ...)
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 10/11] net/ice: " Anatoly Burakov
@ 2021-01-12 17:37               ` Anatoly Burakov
  2021-01-14  9:36               ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management David Marchand
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, thomas, konstantin.ananyev,
	timothy.mcdaniel, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v12:
    - Allow selecting PMD power management scheme from command-line
    - Enforce 1 core 1 queue rule

 .../sample_app_ug/l3_forward_power_man.rst    | 35 ++++++++
 examples/l3fwd-power/main.c                   | 89 ++++++++++++++++++-
 2 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 85a78a5c1e..aaa9367fae 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
+
+PMD Power Management Mode
+-------------------------
+There is also a traffic-aware operating mode that, instead of using explicit
+power management, will use automatic PMD power management. This mode is limited
+to one queue per core, and has three available power management schemes:
+
+* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a
+  power-optimized state (subject to platform support).
+
+* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid
+  busy looping when there is no traffic.
+
+* ``scale`` - this will use frequency scaling routines available in the
+  ``librte_power`` library.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK
+Programmer's Guide for more details on PMD power management.
+
+.. code-block:: console
+
+        ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 995a3b6ad7..e312b6f355 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,11 +200,14 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
 
+static enum rte_power_pmd_mgmt_type pmgmt_type;
+
 enum freq_scale_hint_t
 {
 	FREQ_LOWER    =      -1,
@@ -1611,7 +1615,9 @@ print_usage(const char *prgname)
 		" follow (training_flag, high_threshold, med_threshold)\n"
 		" --telemetry: enable telemetry mode, to update"
 		" empty polls, full polls, and core busyness to telemetry\n"
-		" --interrupt-only: enable interrupt-only mode\n",
+		" --interrupt-only: enable interrupt-only mode\n"
+		" --pmd-mgmt MODE: enable PMD power management mode. "
+		"Currently supported modes: monitor, pause, scale\n",
 		prgname);
 }
 
@@ -1701,6 +1707,32 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+
+static int
+parse_pmd_mgmt_config(const char *name)
+{
+#define PMD_MGMT_MONITOR "monitor"
+#define PMD_MGMT_PAUSE   "pause"
+#define PMD_MGMT_SCALE   "scale"
+
+	if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE;
+		return 0;
+	}
+	/* unknown PMD power management mode */
+	return -1;
+}
+
 static int
 parse_ep_config(const char *q_arg)
 {
@@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 1, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				if (parse_pmd_mgmt_config(optarg) < 0) {
+					printf(" Invalid PMD power management mode: %s\n",
+							optarg);
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2671,6 +2722,13 @@ main(int argc, char **argv)
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
+
+		/* PMD power management mode can only do 1 queue per core */
+		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
+			rte_exit(EXIT_FAILURE,
+				"In PMD power management mode, only one queue per lcore is allowed\n");
+		}
+
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2708,6 +2766,16 @@ main(int argc, char **argv)
 					rte_exit(EXIT_FAILURE,
 						 "Fail to add ptype cb\n");
 			}
+
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_pmd_mgmt_queue_enable(
+						lcore_id, portid, queueid,
+						pmgmt_type);
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+						"rte_power_pmd_mgmt_queue_enable: err=%d, port=%d\n",
+							ret, portid);
+			}
 		}
 	}
 
@@ -2798,6 +2866,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		/* reuse telemetry loop for PMD power management mode */
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2824,6 +2895,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API
  2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API Anatoly Burakov
  2020-12-28 11:00       ` Andrew Rybchenko
@ 2021-01-12 20:32       ` Lance Richardson
  2021-01-13 13:04         ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Lance Richardson @ 2021-01-12 20:32 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, Ananyev, Konstantin, gage.eads,
	timothy.mcdaniel, david.hunt, Bruce Richardson, chris.macnamara

[-- Attachment #1: Type: text/plain, Size: 1298 bytes --]

On Thu, Dec 17, 2020 at 9:08 AM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> From: Liang Ma <liang.j.ma@intel.com>
>
> Add a simple API to allow getting the monitor conditions for
> power-optimized monitoring of the RX queues from the PMD, as well as
> release notes information.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
<snip>
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -917,6 +937,8 @@ struct eth_dev_ops {
>         /**< Set up the connection between the pair of hairpin queues. */
>         eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
>         /**< Disconnect the hairpin queues of a pair from each other. */
> +       eth_get_monitor_addr_t get_monitor_addr;
> +       /**< Get next RX queue ring entry address. */
>  };
>

The implementation of get_monitor_addr will have much in common with
the rx_descriptor_status API in struct rte_eth_dev, including the property
that it will likely not make sense for it to be called concurrently with
rx_pkt_burst on a given queue. Might it make more sense to have this
API in struct rte_eth_dev instead of struct eth_dev_ops?

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function Anatoly Burakov
@ 2021-01-13 12:46                 ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-13 12:46 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, McDaniel,
	Timothy, Hunt, David, Macnamara, Chris



> Now that we have everything in a C file, we can store the information
> about our sleep, and have a native mechanism to wake up the sleeping
> core. This mechanism would however only wake up a core that's sleeping
> while monitoring - waking up from `rte_power_pause` won't work.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v16:
>     - Improve error handling
>     - Take a lock before UMONITOR
> 
>     v13:
>     - Add comments around wakeup code to explain what it does
>     - Add lcore_id parameter checking to prevent buffer overrun
> 
>  lib/librte_eal/arm/rte_power_intrinsics.c     |  9 ++
>  .../include/generic/rte_power_intrinsics.h    | 16 +++
>  lib/librte_eal/ppc/rte_power_intrinsics.c     |  9 ++
>  lib/librte_eal/version.map                    |  1 +
>  lib/librte_eal/x86/rte_power_intrinsics.c     | 98 ++++++++++++++++++-
>  5 files changed, 132 insertions(+), 1 deletion(-)
> 
> --

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> 2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback Anatoly Burakov
@ 2021-01-13 12:58                 ` Ananyev, Konstantin
  2021-01-13 17:29                   ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-13 12:58 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Hunt, David, Ray Kinsella, Neil Horman, thomas,
	McDaniel, Timothy, Richardson, Bruce, Macnamara, Chris



> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Tuesday, January 12, 2021 5:37 PM
> To: dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Hunt, David <david.hunt@intel.com>; Ray Kinsella <mdr@ashroe.eu>; Neil Horman
> <nhorman@tuxdriver.com>; thomas@monjalon.net; Ananyev, Konstantin <konstantin.ananyev@intel.com>; McDaniel, Timothy
> <timothy.mcdaniel@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Macnamara, Chris <chris.macnamara@intel.com>
> Subject: [PATCH v16 07/11] power: add PMD power management API and callback
> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API mandates a core-to-single-queue mapping (that is, multiple
> queued per device are supported, but they have to be polled on different
> cores).
> 
> This design is using PMD RX callbacks.
> 
> 1. UMWAIT/UMONITOR:
> 
>    When a certain threshold of empty polls is reached, the core will go
>    into a power optimized sleep while waiting on an address of next RX
>    descriptor to be written to.
> 
> 2. TPAUSE/Pause instruction
> 
>    This method uses the pause (or TPAUSE, if available) instruction to
>    avoid busy polling.
> 
> 3. Frequency scaling
>    Reuse existing DPDK power library to scale up/down core frequency
>    depending on traffic volume.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v15:
>     - Fix check in UMWAIT callback
> 
>     v13:
>     - Rework the synchronization mechanism to not require locking
>     - Add more parameter checking
>     - Rework n_rx_queues access to not go through internal PMD structures and use
>       public API instead
> 
>     v13:
>     - Rework the synchronization mechanism to not require locking
>     - Add more parameter checking
>     - Rework n_rx_queues access to not go through internal PMD structures and use
>       public API instead
> 
>  doc/guides/prog_guide/power_man.rst    |  44 +++
>  doc/guides/rel_notes/release_21_02.rst |  10 +
>  lib/librte_power/meson.build           |   5 +-
>  lib/librte_power/rte_power_pmd_mgmt.c  | 359 +++++++++++++++++++++++++
>  lib/librte_power/rte_power_pmd_mgmt.h  |  90 +++++++
>  lib/librte_power/version.map           |   5 +
>  6 files changed, 511 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
> 

...

> +
> +static uint16_t
> +clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> +		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> +		void *addr __rte_unused)
> +{
> +
> +	struct pmd_queue_cfg *q_conf;
> +
> +	q_conf = &port_cfg[port_id][qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +			struct rte_power_monitor_cond pmc;
> +			uint16_t ret;
> +
> +			/*
> +			 * we might get a cancellation request while being
> +			 * inside the callback, in which case the wakeup
> +			 * wouldn't work because it would've arrived too early.
> +			 *
> +			 * to get around this, we notify the other thread that
> +			 * we're sleeping, so that it can spin until we're done.
> +			 * unsolicited wakeups are perfectly safe.
> +			 */
> +			q_conf->umwait_in_progress = true;

This write and subsequent read can be reordered by the cpu.
I think you need rte_atomic_thread_fence(__ATOMIC_SEQ_CST) here and
in disable() code-path below.

> +
> +			/* check if we need to cancel sleep */
> +			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> +				/* use monitoring condition to sleep */
> +				ret = rte_eth_get_monitor_addr(port_id, qidx,
> +						&pmc);
> +				if (ret == 0)
> +					rte_power_monitor(&pmc, -1ULL);
> +			}
> +			q_conf->umwait_in_progress = false;
> +		}
> +	} else
> +		q_conf->empty_poll_stats = 0;
> +
> +	return nb_rx;
> +}
> +

...

> +
> +int
> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id)
> +{
> +	struct pmd_queue_cfg *queue_cfg;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +
> +	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
> +		return -EINVAL;
> +
> +	/* no need to check queue id as wrong queue id would not be enabled */
> +	queue_cfg = &port_cfg[port_id][queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
> +		return -EINVAL;
> +
> +	/* let the callback know we're shutting down */
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;

Same as above - write to pwr_mgmt_state and read from umwait_in_progress
could be reordered by cpu.
Need to insert rte_atomic_thread_fence(__ATOMIC_SEQ_CST) between them.

BTW, out of curiosity - why do you need this intermediate
state (PMD_MGMT_BUSY) at all?
Why not directly:
queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
?

> +
> +	switch (queue_cfg->cb_mode) {
> +	case RTE_POWER_MGMT_TYPE_MONITOR:
> +	{
> +		bool exit = false;
> +		do {
> +			/*
> +			 * we may request cancellation while the other thread
> +			 * has just entered the callback but hasn't started
> +			 * sleeping yet, so keep waking it up until we know it's
> +			 * done sleeping.
> +			 */
> +			if (queue_cfg->umwait_in_progress)
> +				rte_power_monitor_wakeup(lcore_id);
> +			else
> +				exit = true;
> +		} while (!exit);
> +	}
> +	/* fall-through */
> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +				queue_cfg->cur_cb);
> +		break;
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +		rte_power_freq_max(lcore_id);
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +				queue_cfg->cur_cb);
> +		rte_power_exit(lcore_id);
> +		break;
> +	}
> +	/*
> +	 * we don't free the RX callback here because it is unsafe to do so
> +	 * unless we know for a fact that all data plane threads have stopped.
> +	 */
> +	queue_cfg->cur_cb = NULL;
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +
> +	return 0;
> +}
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
> new file mode 100644
> index 0000000000..0bfbc6ba69
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.h
> @@ -0,0 +1,90 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_PMD_MGMT_H
> +#define _RTE_POWER_PMD_MGMT_H
> +
> +/**
> + * @file
> + * RTE PMD Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_power.h>
> +#include <rte_atomic.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * PMD Power Management Type
> + */
> +enum rte_power_pmd_mgmt_type {
> +	/** Use power-optimized monitoring to wait for incoming traffic */
> +	RTE_POWER_MGMT_TYPE_MONITOR = 1,
> +	/** Use power-optimized sleep to avoid busy polling */
> +	RTE_POWER_MGMT_TYPE_PAUSE,
> +	/** Use frequency scaling when traffic is low */
> +	RTE_POWER_MGMT_TYPE_SCALE,
> +};
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable power management on a specified RX queue and lcore.
> + *
> + * @note This function is not thread-safe.
> + *
> + * @param lcore_id
> + *   lcore_id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The queue identifier of the Ethernet device.
> + * @param mode
> + *   The power management callback function type.
> +
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int
> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id,
> +		enum rte_power_pmd_mgmt_type mode);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Disable power management on a specified RX queue and lcore.
> + *
> + * @note This function is not thread-safe.
> + *
> + * @param lcore_id
> + *   lcore_id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The queue identifier of the Ethernet device.
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int
> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id);
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
> index 69ca9af616..61996b4d11 100644
> --- a/lib/librte_power/version.map
> +++ b/lib/librte_power/version.map
> @@ -34,4 +34,9 @@ EXPERIMENTAL {
>  	rte_power_guest_channel_receive_msg;
>  	rte_power_poll_stat_fetch;
>  	rte_power_poll_stat_update;
> +
> +	# added in 21.02
> +	rte_power_pmd_mgmt_queue_enable;
> +	rte_power_pmd_mgmt_queue_disable;
> +
>  };
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 03/11] eal: change API of " Anatoly Burakov
@ 2021-01-13 13:01                 ` Ananyev, Konstantin
  2021-01-13 17:22                   ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-13 13:01 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: McDaniel, Timothy, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
	David Christensen, Richardson, Bruce, thomas, Hunt, David,
	Macnamara, Chris


> 
> Instead of passing around pointers and integers, collect everything
> into struct. This makes API design around these intrinsics much easier.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
> 
> Notes:
>     v16:
>     - Add error handling

There are few trivial checkpatch warnings, please check


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API
  2021-01-12 20:32       ` Lance Richardson
@ 2021-01-13 13:04         ` Burakov, Anatoly
  2021-01-13 13:25           ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-13 13:04 UTC (permalink / raw)
  To: Lance Richardson
  Cc: dev, Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, Ananyev, Konstantin, gage.eads,
	timothy.mcdaniel, david.hunt, Bruce Richardson, chris.macnamara

On 12-Jan-21 8:32 PM, Lance Richardson wrote:
> On Thu, Dec 17, 2020 at 9:08 AM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Add a simple API to allow getting the monitor conditions for
>> power-optimized monitoring of the RX queues from the PMD, as well as
>> release notes information.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>> ---
> <snip>
>>   /**
>>    * @internal A structure containing the functions exported by an Ethernet driver.
>>    */
>> @@ -917,6 +937,8 @@ struct eth_dev_ops {
>>          /**< Set up the connection between the pair of hairpin queues. */
>>          eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
>>          /**< Disconnect the hairpin queues of a pair from each other. */
>> +       eth_get_monitor_addr_t get_monitor_addr;
>> +       /**< Get next RX queue ring entry address. */
>>   };
>>
> 
> The implementation of get_monitor_addr will have much in common with
> the rx_descriptor_status API in struct rte_eth_dev, including the property
> that it will likely not make sense for it to be called concurrently with
> rx_pkt_burst on a given queue. Might it make more sense to have this
> API in struct rte_eth_dev instead of struct eth_dev_ops?
> 

I don't have an opinion on this as this code isn't really my area of 
expertise. I'm fine with wherever the community thinks this code should 
be. Any other opinions?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API Anatoly Burakov
@ 2021-01-13 13:18                 ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-13 13:18 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, McDaniel, Timothy, Hunt, David,
	Richardson, Bruce, Macnamara, Chris


> 
> Add a simple API to allow getting the monitor conditions for
> power-optimized monitoring of the Rx queues from the PMD, as well as
> release notes information.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> ---
> 
> Notes:
>     v13:
>     - Fix typos and issues raised by Andrew
> 
>  doc/guides/rel_notes/release_21_02.rst |  5 +++++
>  lib/librte_ethdev/rte_ethdev.c         | 28 ++++++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev.h         | 25 +++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_driver.h  | 22 ++++++++++++++++++++
>  lib/librte_ethdev/version.map          |  3 +++
>  5 files changed, 83 insertions(+)
> --

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API
  2021-01-13 13:04         ` Burakov, Anatoly
@ 2021-01-13 13:25           ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-13 13:25 UTC (permalink / raw)
  To: Burakov, Anatoly, Lance Richardson
  Cc: dev, Ma, Liang J, Thomas Monjalon, Yigit, Ferruh,
	Andrew Rybchenko, Ray Kinsella, Neil Horman, gage.eads, McDaniel,
	Timothy, Hunt, David, Richardson, Bruce, Macnamara, Chris


> 
> On 12-Jan-21 8:32 PM, Lance Richardson wrote:
> > On Thu, Dec 17, 2020 at 9:08 AM Anatoly Burakov
> > <anatoly.burakov@intel.com> wrote:
> >>
> >> From: Liang Ma <liang.j.ma@intel.com>
> >>
> >> Add a simple API to allow getting the monitor conditions for
> >> power-optimized monitoring of the RX queues from the PMD, as well as
> >> release notes information.
> >>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> >> ---
> > <snip>
> >>   /**
> >>    * @internal A structure containing the functions exported by an Ethernet driver.
> >>    */
> >> @@ -917,6 +937,8 @@ struct eth_dev_ops {
> >>          /**< Set up the connection between the pair of hairpin queues. */
> >>          eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
> >>          /**< Disconnect the hairpin queues of a pair from each other. */
> >> +       eth_get_monitor_addr_t get_monitor_addr;
> >> +       /**< Get next RX queue ring entry address. */
> >>   };
> >>
> >
> > The implementation of get_monitor_addr will have much in common with
> > the rx_descriptor_status API in struct rte_eth_dev, including the property
> > that it will likely not make sense for it to be called concurrently with
> > rx_pkt_burst on a given queue. Might it make more sense to have this
> > API in struct rte_eth_dev instead of struct eth_dev_ops?
> >
> 
> I don't have an opinion on this as this code isn't really my area of
> expertise. I'm fine with wherever the community thinks this code should
> be. Any other opinions?
> 

I don't think it is a good idea to  push new members into rte_eth_dev.
It either means an ABI breakage or wasting of one of our reserved fields.
IMO this function is not that performance critical to justify such insersion.
In fact, I think we should look in different direction -
remove rx/tx_descriptor_status() functions from rte_eth_dev,
or even better make rte_eth_dev an opaque pointer. 


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics
  2021-01-13 13:01                 ` Ananyev, Konstantin
@ 2021-01-13 17:22                   ` Burakov, Anatoly
  2021-01-13 18:01                     ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-13 17:22 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: McDaniel, Timothy, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
	David Christensen, Richardson, Bruce, thomas, Hunt, David,
	Macnamara, Chris

On 13-Jan-21 1:01 PM, Ananyev, Konstantin wrote:
> 
>>
>> Instead of passing around pointers and integers, collect everything
>> into struct. This makes API design around these intrinsics much easier.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>> ---
>>
>> Notes:
>>      v16:
>>      - Add error handling
> 
> There are few trivial checkpatch warnings, please check
> 

To paraphrase Nick Fury, I recognize that checkpatch has produced 
warnings, but given that i don't agree with what checkpatch has to say 
in this case, I've elected to ignore it :)

In particular, these warnings related to comments around struct members, 
which i think i've made to look nice and also took care of correct 
indentation in terms of code looking the same way with different tab 
widths. So, i don't think it should be changed, unless you're suggesting 
to re-layout comments on top of each member, rather than at the side 
(which i think is more readable).

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback
  2021-01-13 12:58                 ` Ananyev, Konstantin
@ 2021-01-13 17:29                   ` Burakov, Anatoly
  2021-01-14 13:00                     ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-13 17:29 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Ma, Liang J, Hunt, David, Ray Kinsella, Neil Horman, thomas,
	McDaniel, Timothy, Richardson, Bruce, Macnamara, Chris

On 13-Jan-21 12:58 PM, Ananyev, Konstantin wrote:
> 
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Tuesday, January 12, 2021 5:37 PM
>> To: dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Hunt, David <david.hunt@intel.com>; Ray Kinsella <mdr@ashroe.eu>; Neil Horman
>> <nhorman@tuxdriver.com>; thomas@monjalon.net; Ananyev, Konstantin <konstantin.ananyev@intel.com>; McDaniel, Timothy
>> <timothy.mcdaniel@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Macnamara, Chris <chris.macnamara@intel.com>
>> Subject: [PATCH v16 07/11] power: add PMD power management API and callback
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Add a simple on/off switch that will enable saving power when no
>> packets are arriving. It is based on counting the number of empty
>> polls and, when the number reaches a certain threshold, entering an
>> architecture-defined optimized power state that will either wait
>> until a TSC timestamp expires, or when packets arrive.
>>
>> This API mandates a core-to-single-queue mapping (that is, multiple
>> queued per device are supported, but they have to be polled on different
>> cores).
>>
>> This design is using PMD RX callbacks.
>>
>> 1. UMWAIT/UMONITOR:
>>
>>     When a certain threshold of empty polls is reached, the core will go
>>     into a power optimized sleep while waiting on an address of next RX
>>     descriptor to be written to.
>>
>> 2. TPAUSE/Pause instruction
>>
>>     This method uses the pause (or TPAUSE, if available) instruction to
>>     avoid busy polling.
>>
>> 3. Frequency scaling
>>     Reuse existing DPDK power library to scale up/down core frequency
>>     depending on traffic volume.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v15:
>>      - Fix check in UMWAIT callback
>>
>>      v13:
>>      - Rework the synchronization mechanism to not require locking
>>      - Add more parameter checking
>>      - Rework n_rx_queues access to not go through internal PMD structures and use
>>        public API instead
>>
>>      v13:
>>      - Rework the synchronization mechanism to not require locking
>>      - Add more parameter checking
>>      - Rework n_rx_queues access to not go through internal PMD structures and use
>>        public API instead
>>
>>   doc/guides/prog_guide/power_man.rst    |  44 +++
>>   doc/guides/rel_notes/release_21_02.rst |  10 +
>>   lib/librte_power/meson.build           |   5 +-
>>   lib/librte_power/rte_power_pmd_mgmt.c  | 359 +++++++++++++++++++++++++
>>   lib/librte_power/rte_power_pmd_mgmt.h  |  90 +++++++
>>   lib/librte_power/version.map           |   5 +
>>   6 files changed, 511 insertions(+), 2 deletions(-)
>>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
>>
> 
> ...
> 
>> +
>> +static uint16_t
>> +clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>> +uint16_t nb_rx, uint16_t max_pkts __rte_unused,
>> +void *addr __rte_unused)
>> +{
>> +
>> +struct pmd_queue_cfg *q_conf;
>> +
>> +q_conf = &port_cfg[port_id][qidx];
>> +
>> +if (unlikely(nb_rx == 0)) {
>> +q_conf->empty_poll_stats++;
>> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> +struct rte_power_monitor_cond pmc;
>> +uint16_t ret;
>> +
>> +/*
>> + * we might get a cancellation request while being
>> + * inside the callback, in which case the wakeup
>> + * wouldn't work because it would've arrived too early.
>> + *
>> + * to get around this, we notify the other thread that
>> + * we're sleeping, so that it can spin until we're done.
>> + * unsolicited wakeups are perfectly safe.
>> + */
>> +q_conf->umwait_in_progress = true;
> 
> This write and subsequent read can be reordered by the cpu.
> I think you need rte_atomic_thread_fence(__ATOMIC_SEQ_CST) here and
> in disable() code-path below.
> 
>> +
>> +/* check if we need to cancel sleep */
>> +if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
>> +/* use monitoring condition to sleep */
>> +ret = rte_eth_get_monitor_addr(port_id, qidx,
>> +&pmc);
>> +if (ret == 0)
>> +rte_power_monitor(&pmc, -1ULL);
>> +}
>> +q_conf->umwait_in_progress = false;
>> +}
>> +} else
>> +q_conf->empty_poll_stats = 0;
>> +
>> +return nb_rx;
>> +}
>> +
> 
> ...
> 
>> +
>> +int
>> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>> +uint16_t port_id, uint16_t queue_id)
>> +{
>> +struct pmd_queue_cfg *queue_cfg;
>> +
>> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
>> +
>> +if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
>> +return -EINVAL;
>> +
>> +/* no need to check queue id as wrong queue id would not be enabled */
>> +queue_cfg = &port_cfg[port_id][queue_id];
>> +
>> +if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>> +return -EINVAL;
>> +
>> +/* let the callback know we're shutting down */
>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
> 
> Same as above - write to pwr_mgmt_state and read from umwait_in_progress
> could be reordered by cpu.
> Need to insert rte_atomic_thread_fence(__ATOMIC_SEQ_CST) between them.
> 
> BTW, out of curiosity - why do you need this intermediate
> state (PMD_MGMT_BUSY) at all?
> Why not directly:
> queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> ?
> 

Thanks for suggestions, i'll add those.

The goal for the "intermediate" step is to prevent Rx callback from 
sleeping in the first place. We can't "wake up" earlier than it goes to 
sleep, but we may get a request to disable power management while we're 
at the beginning of the callback and haven't yet entered the 
rte_power_monitor code.

In this case, setting it to "BUSY" will prevent the callback from ever 
sleeping in the first place (see rte_power_pmd_mgmt:108 check), and will 
unset the "umwait in progress" if there was any.

So, we have three situations to handle:

1) wake up during umwait
2) "wake up" during callback after we've set the "umwait in progress" 
flag but before actual umwait happens - we don't wait to exit before 
we're sure there's nothing sleeping there
3) "wake up" during callback before we set the "umwait in progress" flag

1) is handled by the rte_power_monitor_wakeup() call, so that's taken 
care of. 2) is handled by the other thread waiting on "umwait in 
progress" becoming false. 3) is handled by having this BUSY check in the 
umwait thread.

Hope that made sense!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics
  2021-01-13 17:22                   ` Burakov, Anatoly
@ 2021-01-13 18:01                     ` Ananyev, Konstantin
  2021-01-14 10:23                       ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-13 18:01 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: McDaniel, Timothy, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
	David Christensen, Richardson, Bruce, thomas, Hunt, David,
	Macnamara, Chris


> On 13-Jan-21 1:01 PM, Ananyev, Konstantin wrote:
> >
> >>
> >> Instead of passing around pointers and integers, collect everything
> >> into struct. This makes API design around these intrinsics much easier.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> >> ---
> >>
> >> Notes:
> >>      v16:
> >>      - Add error handling
> >
> > There are few trivial checkpatch warnings, please check
> >
> 
> To paraphrase Nick Fury, I recognize that checkpatch has produced
> warnings, but given that i don't agree with what checkpatch has to say
> in this case, I've elected to ignore it :)
> 
> In particular, these warnings related to comments around struct members,
> which i think i've made to look nice and also took care of correct
> indentation in terms of code looking the same way with different tab
> widths. So, i don't think it should be changed, unless you're suggesting
> to re-layout comments on top of each member, rather than at the side
> (which i think is more readable).

If top is not an option, it is possible to move comment on next after actual field lines:
	uint32_t x;
	/**<
	 * blah, blah
	 * blah, blah, blah
	 */
AFAIK that would keep checkpatch happy.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 00/11] Add PMD power management
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
                                 ` (10 preceding siblings ...)
  2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2021-01-14  9:36               ` David Marchand
  2021-01-14 10:25                 ` Burakov, Anatoly
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
  12 siblings, 1 reply; 421+ messages in thread
From: David Marchand @ 2021-01-14  9:36 UTC (permalink / raw)
  To: Anatoly Burakov, Ray Kinsella
  Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Timothy McDaniel,
	David Hunt, Bruce Richardson, chris.macnamara, Kevin Traynor

On Tue, Jan 12, 2021 at 6:37 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> This patchset proposes a simple API for Ethernet drivers to cause the
> CPU to enter a power-optimized state while waiting for packets to
> arrive. There are multiple proposed mechanisms to achieve said power
> savings: simple frequency scaling, idle loop, and monitoring the Rx
> queue for incoming packages. The latter is achieved through cooperation
> with the NIC driver that will allow us to know address of wake up event,
> and wait for writes on that address.
>
> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
> are used in their raw opcode form because there is no widespread
> compiler support for them yet. Still, the API is made generic enough to
> hopefully support other architectures, if they happen to implement
> similar instructions.
>
> To achieve power savings, there is a very simple mechanism used: we're
> counting empty polls, and if a certain threshold is reached, we employ
> one of the suggested power management schemes automatically, from within
> a Rx callback inside the PMD. Once there's traffic again, the empty poll
> counter is reset.
>
> This patchset also introduces a few changes into existing power
> management-related intrinsics, namely to provide a native way of waking
> up a sleeping core without application being responsible for it, as well
> as general robustness improvements. There's quite a bit of locking going
> on, but these locks are per-thread and very little (if any) contention
> is expected, so the performance impact shouldn't be that bad (and in any
> case the locking happens when we're about to sleep anyway).
>
> Why are we putting it into ethdev as opposed to leaving this up to the
> application? Our customers specifically requested a way to do it with
> minimal changes to the application code. The current approach allows to
> just flip a switch and automatically have power savings.
>
> Things of note:
>
> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>   must at most handle RX on a single queue

If we want to save power, it is likely we would poll more rxqs on a thread.


> - Support 3 type policies. Monitor/Pause/Frequency Scaling
> - Power management is enabled per-queue
> - The API doesn't extend to other device types
>
> v16:
> - Implemented Konstantin's suggestions and comments
> - Added return values to the API

- This revision breaks SPDK build (reported by UNH):
http://mails.dpdk.org/archives/test-report/2021-January/174069.html


- Build is broken for ARM and PPC at patch:
86491d5bd4 - (HEAD) eal: add monitor wakeup function (25 minutes ago)
<Anatoly Burakov>

Only pasting the ARM failure:
ninja: Entering directory `/home/dmarchan/builds/build-arm64-host-clang'
[1/297] Compiling C object
'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o'.
FAILED: lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o
aarch64-linux-gnu-gcc -Ilib/76b5a35@@rte_eal@sta -Ilib
-I../../dpdk/lib -I. -I../../dpdk/ -Iconfig -I../../dpdk/config
-Ilib/librte_eal/include -I../../dpdk/lib/librte_eal/include
-Ilib/librte_eal/linux/include
-I../../dpdk/lib/librte_eal/linux/include -Ilib/librte_eal/arm/include
-I../../dpdk/lib/librte_eal/arm/include -Ilib/librte_eal/common
-I../../dpdk/lib/librte_eal/common -Ilib/librte_eal
-I../../dpdk/lib/librte_eal -Ilib/librte_kvargs
-I../../dpdk/lib/librte_kvargs
-Ilib/librte_telemetry/../librte_metrics
-I../../dpdk/lib/librte_telemetry/../librte_metrics
-Ilib/librte_telemetry -I../../dpdk/lib/librte_telemetry
-fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall
-Winvalid-pch -Werror -O2 -g -include rte_config.h -Wextra -Wcast-qual
-Wdeprecated -Wformat -Wformat-nonliteral -Wformat-security
-Wmissing-declarations -Wmissing-prototypes -Wnested-externs
-Wold-style-definition -Wpointer-arith -Wsign-compare
-Wstrict-prototypes -Wundef -Wwrite-strings -Wno-packed-not-aligned
-Wno-missing-field-initializers -D_GNU_SOURCE -fPIC -march=armv8-a+crc
-DALLOW_EXPERIMENTAL_API -DALLOW_INTERNAL_API -Wno-format-truncation
'-DABI_VERSION="21.1"' -DRTE_LIBEAL_USE_GETENTROPY -MD -MQ
'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o' -MF
'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o.d'
-o 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o'
-c ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c
../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:35:1: error:
conflicting types for ‘rte_power_monitor_wakeup’
 rte_power_monitor_wakeup(const unsigned int lcore_id)
 ^~~~~~~~~~~~~~~~~~~~~~~~
In file included from
../../dpdk/lib/librte_eal/arm/include/rte_power_intrinsics.h:14,
                 from ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:5:
../../dpdk/lib/librte_eal/include/generic/rte_power_intrinsics.h:79:5:
note: previous declaration of ‘rte_power_monitor_wakeup’ was here
 int rte_power_monitor_wakeup(const unsigned int lcore_id);
     ^~~~~~~~~~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.



- The ABI check is still not happy as I reported earlier.
Reproduced on v16 (GHA had a hiccup on this revision, but previous
ones had the failure too):

1 Changed variable:

  [C] 'rte_eth_dev rte_eth_devices[]' was changed at rte_ethdev_core.h:196:1:
    type of variable changed:
      array element type 'struct rte_eth_dev' changed:
        type size hasn't changed
        1 data member change:
          type of 'const eth_dev_ops* rte_eth_dev::dev_ops' changed:
            in pointed to type 'const eth_dev_ops':
              in unqualified underlying type 'struct eth_dev_ops' at
rte_ethdev_driver.h:789:1:
                type size changed from 6208 to 6272 (in bits)
                1 data member insertion:
                  'eth_get_monitor_addr_t
eth_dev_ops::get_monitor_addr', at offset 6208 (in bits) at
rte_ethdev_driver.h:940:1
                no data member changes (94 filtered);
      type size hasn't changed

Error: ABI issue reported for 'abidiff --suppr
/home/dmarchan/dpdk/devtools/../devtools/libabigail.abignore
--no-added-syms --headers-dir1
/home/dmarchan/abi/v20.11/build-gcc-static/usr/local/include
--headers-dir2 /home/dmarchan/builds/build-gcc-static/install/usr/local/include
/home/dmarchan/abi/v20.11/build-gcc-static/dump/librte_ethdev.dump
/home/dmarchan/builds/build-gcc-static/install/dump/librte_ethdev.dump'

ABIDIFF_ABI_CHANGE, this change requires a review (abidiff flagged
this as a potential issue).

One solution is to add an exception on the eth_dev_ops structure.

--- a/devtools/libabigail.abignore
+++ b/devtools/libabigail.abignore
@@ -7,3 +7,7 @@
         symbol_version = INTERNAL
 [suppress_variable]
         symbol_version = INTERNAL
+
+; Explicit ignore for driver-only ABI
+[suppress_type]
+        name = eth_dev_ops


-- 
David marchand


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics
  2021-01-13 18:01                     ` Ananyev, Konstantin
@ 2021-01-14 10:23                       ` Burakov, Anatoly
  2021-01-14 12:33                         ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-14 10:23 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: McDaniel, Timothy, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
	David Christensen, Richardson, Bruce, thomas, Hunt, David,
	Macnamara, Chris

On 13-Jan-21 6:01 PM, Ananyev, Konstantin wrote:
> 
>> On 13-Jan-21 1:01 PM, Ananyev, Konstantin wrote:
>>>
>>>>
>>>> Instead of passing around pointers and integers, collect everything
>>>> into struct. This makes API design around these intrinsics much easier.
>>>>
>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>>>> ---
>>>>
>>>> Notes:
>>>>       v16:
>>>>       - Add error handling
>>>
>>> There are few trivial checkpatch warnings, please check
>>>
>>
>> To paraphrase Nick Fury, I recognize that checkpatch has produced
>> warnings, but given that i don't agree with what checkpatch has to say
>> in this case, I've elected to ignore it :)
>>
>> In particular, these warnings related to comments around struct members,
>> which i think i've made to look nice and also took care of correct
>> indentation in terms of code looking the same way with different tab
>> widths. So, i don't think it should be changed, unless you're suggesting
>> to re-layout comments on top of each member, rather than at the side
>> (which i think is more readable).
> 
> If top is not an option, it is possible to move comment on next after actual field lines:
> uint32_t x;
> /**<
>   * blah, blah
>   * blah, blah, blah
>   */
> AFAIK that would keep checkpatch happy.
> 

It's not as much "not an option" as it would look less readable to me 
than what there currently is. If we're going to keep comments not on the 
side, then on the top they go. I'd prefer to keep it as is, but if you 
feel strongly about it, i can change it.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 00/11] Add PMD power management
  2021-01-14  9:36               ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management David Marchand
@ 2021-01-14 10:25                 ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-14 10:25 UTC (permalink / raw)
  To: David Marchand, Ray Kinsella
  Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Timothy McDaniel,
	David Hunt, Bruce Richardson, chris.macnamara, Kevin Traynor

On 14-Jan-21 9:36 AM, David Marchand wrote:
> On Tue, Jan 12, 2021 at 6:37 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> This patchset proposes a simple API for Ethernet drivers to cause the
>> CPU to enter a power-optimized state while waiting for packets to
>> arrive. There are multiple proposed mechanisms to achieve said power
>> savings: simple frequency scaling, idle loop, and monitoring the Rx
>> queue for incoming packages. The latter is achieved through cooperation
>> with the NIC driver that will allow us to know address of wake up event,
>> and wait for writes on that address.
>>
>> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
>> are used in their raw opcode form because there is no widespread
>> compiler support for them yet. Still, the API is made generic enough to
>> hopefully support other architectures, if they happen to implement
>> similar instructions.
>>
>> To achieve power savings, there is a very simple mechanism used: we're
>> counting empty polls, and if a certain threshold is reached, we employ
>> one of the suggested power management schemes automatically, from within
>> a Rx callback inside the PMD. Once there's traffic again, the empty poll
>> counter is reset.
>>
>> This patchset also introduces a few changes into existing power
>> management-related intrinsics, namely to provide a native way of waking
>> up a sleeping core without application being responsible for it, as well
>> as general robustness improvements. There's quite a bit of locking going
>> on, but these locks are per-thread and very little (if any) contention
>> is expected, so the performance impact shouldn't be that bad (and in any
>> case the locking happens when we're about to sleep anyway).
>>
>> Why are we putting it into ethdev as opposed to leaving this up to the
>> application? Our customers specifically requested a way to do it with
>> minimal changes to the application code. The current approach allows to
>> just flip a switch and automatically have power savings.
>>
>> Things of note:
>>
>> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>>    must at most handle RX on a single queue
> 
> If we want to save power, it is likely we would poll more rxqs on a thread.

We are investigating possibilities to make that happen, but for this 
patchset, this is the limitation.

> 
> 
>> - Support 3 type policies. Monitor/Pause/Frequency Scaling
>> - Power management is enabled per-queue
>> - The API doesn't extend to other device types
>>
>> v16:
>> - Implemented Konstantin's suggestions and comments
>> - Added return values to the API
> 
> - This revision breaks SPDK build (reported by UNH):
> http://mails.dpdk.org/archives/test-report/2021-January/174069.html
> 
> 
> - Build is broken for ARM and PPC at patch:
> 86491d5bd4 - (HEAD) eal: add monitor wakeup function (25 minutes ago)
> <Anatoly Burakov>
> 
> Only pasting the ARM failure:
> ninja: Entering directory `/home/dmarchan/builds/build-arm64-host-clang'
> [1/297] Compiling C object
> 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o'.
> FAILED: lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o
> aarch64-linux-gnu-gcc -Ilib/76b5a35@@rte_eal@sta -Ilib
> -I../../dpdk/lib -I. -I../../dpdk/ -Iconfig -I../../dpdk/config
> -Ilib/librte_eal/include -I../../dpdk/lib/librte_eal/include
> -Ilib/librte_eal/linux/include
> -I../../dpdk/lib/librte_eal/linux/include -Ilib/librte_eal/arm/include
> -I../../dpdk/lib/librte_eal/arm/include -Ilib/librte_eal/common
> -I../../dpdk/lib/librte_eal/common -Ilib/librte_eal
> -I../../dpdk/lib/librte_eal -Ilib/librte_kvargs
> -I../../dpdk/lib/librte_kvargs
> -Ilib/librte_telemetry/../librte_metrics
> -I../../dpdk/lib/librte_telemetry/../librte_metrics
> -Ilib/librte_telemetry -I../../dpdk/lib/librte_telemetry
> -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall
> -Winvalid-pch -Werror -O2 -g -include rte_config.h -Wextra -Wcast-qual
> -Wdeprecated -Wformat -Wformat-nonliteral -Wformat-security
> -Wmissing-declarations -Wmissing-prototypes -Wnested-externs
> -Wold-style-definition -Wpointer-arith -Wsign-compare
> -Wstrict-prototypes -Wundef -Wwrite-strings -Wno-packed-not-aligned
> -Wno-missing-field-initializers -D_GNU_SOURCE -fPIC -march=armv8-a+crc
> -DALLOW_EXPERIMENTAL_API -DALLOW_INTERNAL_API -Wno-format-truncation
> '-DABI_VERSION="21.1"' -DRTE_LIBEAL_USE_GETENTROPY -MD -MQ
> 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o' -MF
> 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o.d'
> -o 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o'
> -c ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c
> ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:35:1: error:
> conflicting types for ‘rte_power_monitor_wakeup’
>   rte_power_monitor_wakeup(const unsigned int lcore_id)
>   ^~~~~~~~~~~~~~~~~~~~~~~~
> In file included from
> ../../dpdk/lib/librte_eal/arm/include/rte_power_intrinsics.h:14,
>                   from ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:5:
> ../../dpdk/lib/librte_eal/include/generic/rte_power_intrinsics.h:79:5:
> note: previous declaration of ‘rte_power_monitor_wakeup’ was here
>   int rte_power_monitor_wakeup(const unsigned int lcore_id);
>       ^~~~~~~~~~~~~~~~~~~~~~~~
> ninja: build stopped: subcommand failed.

Woops, wrong return value in the .c files. Will fix!

> 
> 
> 
> - The ABI check is still not happy as I reported earlier.
> Reproduced on v16 (GHA had a hiccup on this revision, but previous
> ones had the failure too):
> 
> 1 Changed variable:
> 
>    [C] 'rte_eth_dev rte_eth_devices[]' was changed at rte_ethdev_core.h:196:1:
>      type of variable changed:
>        array element type 'struct rte_eth_dev' changed:
>          type size hasn't changed
>          1 data member change:
>            type of 'const eth_dev_ops* rte_eth_dev::dev_ops' changed:
>              in pointed to type 'const eth_dev_ops':
>                in unqualified underlying type 'struct eth_dev_ops' at
> rte_ethdev_driver.h:789:1:
>                  type size changed from 6208 to 6272 (in bits)
>                  1 data member insertion:
>                    'eth_get_monitor_addr_t
> eth_dev_ops::get_monitor_addr', at offset 6208 (in bits) at
> rte_ethdev_driver.h:940:1
>                  no data member changes (94 filtered);
>        type size hasn't changed
> 
> Error: ABI issue reported for 'abidiff --suppr
> /home/dmarchan/dpdk/devtools/../devtools/libabigail.abignore
> --no-added-syms --headers-dir1
> /home/dmarchan/abi/v20.11/build-gcc-static/usr/local/include
> --headers-dir2 /home/dmarchan/builds/build-gcc-static/install/usr/local/include
> /home/dmarchan/abi/v20.11/build-gcc-static/dump/librte_ethdev.dump
> /home/dmarchan/builds/build-gcc-static/install/dump/librte_ethdev.dump'
> 
> ABIDIFF_ABI_CHANGE, this change requires a review (abidiff flagged
> this as a potential issue).
> 
> One solution is to add an exception on the eth_dev_ops structure.
> 
> --- a/devtools/libabigail.abignore
> +++ b/devtools/libabigail.abignore
> @@ -7,3 +7,7 @@
>           symbol_version = INTERNAL
>   [suppress_variable]
>           symbol_version = INTERNAL
> +
> +; Explicit ignore for driver-only ABI
> +[suppress_type]
> +        name = eth_dev_ops
> 
> 

Right, OK. I didn't realize an "exception" is something you actually do 
in code, not an ad-hoc community process type thing :) I'll add this in 
the next revision.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics
  2021-01-14 10:23                       ` Burakov, Anatoly
@ 2021-01-14 12:33                         ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2021-01-14 12:33 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: McDaniel, Timothy, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
	David Christensen, Richardson, Bruce, thomas, Hunt, David,
	Macnamara, Chris



> 
> On 13-Jan-21 6:01 PM, Ananyev, Konstantin wrote:
> >
> >> On 13-Jan-21 1:01 PM, Ananyev, Konstantin wrote:
> >>>
> >>>>
> >>>> Instead of passing around pointers and integers, collect everything
> >>>> into struct. This makes API design around these intrinsics much easier.
> >>>>
> >>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> >>>> ---
> >>>>
> >>>> Notes:
> >>>>       v16:
> >>>>       - Add error handling
> >>>
> >>> There are few trivial checkpatch warnings, please check
> >>>
> >>
> >> To paraphrase Nick Fury, I recognize that checkpatch has produced
> >> warnings, but given that i don't agree with what checkpatch has to say
> >> in this case, I've elected to ignore it :)
> >>
> >> In particular, these warnings related to comments around struct members,
> >> which i think i've made to look nice and also took care of correct
> >> indentation in terms of code looking the same way with different tab
> >> widths. So, i don't think it should be changed, unless you're suggesting
> >> to re-layout comments on top of each member, rather than at the side
> >> (which i think is more readable).
> >
> > If top is not an option, it is possible to move comment on next after actual field lines:
> > uint32_t x;
> > /**<
> >   * blah, blah
> >   * blah, blah, blah
> >   */
> > AFAIK that would keep checkpatch happy.
> >
> 
> It's not as much "not an option" as it would look less readable to me
> than what there currently is. If we're going to keep comments not on the
> side, then on the top they go. I'd prefer to keep it as is, but if you
> feel strongly about it, i can change it.

I don't have any preferences about comments placement.
I just thought it would be good to keep checkpatch happy.
Specially if the changes in question are just cosmetic ones.  



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback
  2021-01-13 17:29                   ` Burakov, Anatoly
@ 2021-01-14 13:00                     ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-14 13:00 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Ma, Liang J, Hunt, David, Ray Kinsella, Neil Horman, thomas,
	McDaniel, Timothy, Richardson, Bruce, Macnamara, Chris

On 13-Jan-21 5:29 PM, Burakov, Anatoly wrote:
> On 13-Jan-21 12:58 PM, Ananyev, Konstantin wrote:
>>
>>
>>> -----Original Message-----
>>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>>> Sent: Tuesday, January 12, 2021 5:37 PM
>>> To: dev@dpdk.org
>>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Hunt, David 
>>> <david.hunt@intel.com>; Ray Kinsella <mdr@ashroe.eu>; Neil Horman
>>> <nhorman@tuxdriver.com>; thomas@monjalon.net; Ananyev, Konstantin 
>>> <konstantin.ananyev@intel.com>; McDaniel, Timothy
>>> <timothy.mcdaniel@intel.com>; Richardson, Bruce 
>>> <bruce.richardson@intel.com>; Macnamara, Chris 
>>> <chris.macnamara@intel.com>
>>> Subject: [PATCH v16 07/11] power: add PMD power management API and 
>>> callback
>>>
>>> From: Liang Ma <liang.j.ma@intel.com>
>>>
>>> Add a simple on/off switch that will enable saving power when no
>>> packets are arriving. It is based on counting the number of empty
>>> polls and, when the number reaches a certain threshold, entering an
>>> architecture-defined optimized power state that will either wait
>>> until a TSC timestamp expires, or when packets arrive.
>>>
>>> This API mandates a core-to-single-queue mapping (that is, multiple
>>> queued per device are supported, but they have to be polled on different
>>> cores).
>>>
>>> This design is using PMD RX callbacks.
>>>
>>> 1. UMWAIT/UMONITOR:
>>>
>>>     When a certain threshold of empty polls is reached, the core will go
>>>     into a power optimized sleep while waiting on an address of next RX
>>>     descriptor to be written to.
>>>
>>> 2. TPAUSE/Pause instruction
>>>
>>>     This method uses the pause (or TPAUSE, if available) instruction to
>>>     avoid busy polling.
>>>
>>> 3. Frequency scaling
>>>     Reuse existing DPDK power library to scale up/down core frequency
>>>     depending on traffic volume.
>>>
>>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>> ---
>>>
>>> Notes:
>>>      v15:
>>>      - Fix check in UMWAIT callback
>>>
>>>      v13:
>>>      - Rework the synchronization mechanism to not require locking
>>>      - Add more parameter checking
>>>      - Rework n_rx_queues access to not go through internal PMD 
>>> structures and use
>>>        public API instead
>>>
>>>      v13:
>>>      - Rework the synchronization mechanism to not require locking
>>>      - Add more parameter checking
>>>      - Rework n_rx_queues access to not go through internal PMD 
>>> structures and use
>>>        public API instead
>>>
>>>   doc/guides/prog_guide/power_man.rst    |  44 +++
>>>   doc/guides/rel_notes/release_21_02.rst |  10 +
>>>   lib/librte_power/meson.build           |   5 +-
>>>   lib/librte_power/rte_power_pmd_mgmt.c  | 359 +++++++++++++++++++++++++
>>>   lib/librte_power/rte_power_pmd_mgmt.h  |  90 +++++++
>>>   lib/librte_power/version.map           |   5 +
>>>   6 files changed, 511 insertions(+), 2 deletions(-)
>>>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>>>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
>>>
>>
>> ...
>>
>>> +
>>> +static uint16_t
>>> +clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts 
>>> __rte_unused,
>>> +uint16_t nb_rx, uint16_t max_pkts __rte_unused,
>>> +void *addr __rte_unused)
>>> +{
>>> +
>>> +struct pmd_queue_cfg *q_conf;
>>> +
>>> +q_conf = &port_cfg[port_id][qidx];
>>> +
>>> +if (unlikely(nb_rx == 0)) {
>>> +q_conf->empty_poll_stats++;
>>> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>>> +struct rte_power_monitor_cond pmc;
>>> +uint16_t ret;
>>> +
>>> +/*
>>> + * we might get a cancellation request while being
>>> + * inside the callback, in which case the wakeup
>>> + * wouldn't work because it would've arrived too early.
>>> + *
>>> + * to get around this, we notify the other thread that
>>> + * we're sleeping, so that it can spin until we're done.
>>> + * unsolicited wakeups are perfectly safe.
>>> + */
>>> +q_conf->umwait_in_progress = true;
>>
>> This write and subsequent read can be reordered by the cpu.
>> I think you need rte_atomic_thread_fence(__ATOMIC_SEQ_CST) here and
>> in disable() code-path below.
>>
>>> +
>>> +/* check if we need to cancel sleep */
>>> +if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
>>> +/* use monitoring condition to sleep */
>>> +ret = rte_eth_get_monitor_addr(port_id, qidx,
>>> +&pmc);
>>> +if (ret == 0)
>>> +rte_power_monitor(&pmc, -1ULL);
>>> +}
>>> +q_conf->umwait_in_progress = false;
>>> +}
>>> +} else
>>> +q_conf->empty_poll_stats = 0;
>>> +
>>> +return nb_rx;
>>> +}
>>> +
>>
>> ...
>>
>>> +
>>> +int
>>> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>>> +uint16_t port_id, uint16_t queue_id)
>>> +{
>>> +struct pmd_queue_cfg *queue_cfg;
>>> +
>>> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
>>> +
>>> +if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
>>> +return -EINVAL;
>>> +
>>> +/* no need to check queue id as wrong queue id would not be enabled */
>>> +queue_cfg = &port_cfg[port_id][queue_id];
>>> +
>>> +if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>>> +return -EINVAL;
>>> +
>>> +/* let the callback know we're shutting down */
>>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
>>
>> Same as above - write to pwr_mgmt_state and read from umwait_in_progress
>> could be reordered by cpu.
>> Need to insert rte_atomic_thread_fence(__ATOMIC_SEQ_CST) between them.
>>
>> BTW, out of curiosity - why do you need this intermediate
>> state (PMD_MGMT_BUSY) at all?
>> Why not directly:
>> queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>> ?
>>
> 
> Thanks for suggestions, i'll add those.
> 
> The goal for the "intermediate" step is to prevent Rx callback from 
> sleeping in the first place. We can't "wake up" earlier than it goes to 
> sleep, but we may get a request to disable power management while we're 
> at the beginning of the callback and haven't yet entered the 
> rte_power_monitor code.
> 
> In this case, setting it to "BUSY" will prevent the callback from ever 
> sleeping in the first place (see rte_power_pmd_mgmt:108 check), and will 
> unset the "umwait in progress" if there was any.
> 
> So, we have three situations to handle:
> 
> 1) wake up during umwait
> 2) "wake up" during callback after we've set the "umwait in progress" 
> flag but before actual umwait happens - we don't wait to exit before 
> we're sure there's nothing sleeping there
> 3) "wake up" during callback before we set the "umwait in progress" flag
> 
> 1) is handled by the rte_power_monitor_wakeup() call, so that's taken 
> care of. 2) is handled by the other thread waiting on "umwait in 
> progress" becoming false. 3) is handled by having this BUSY check in the 
> umwait thread.
> 
> Hope that made sense!
> 

On further thoughts, the "BUSY" thing relies on a hidden assumption that 
enable/disable power management per queue is supposed to be thread safe. 
If we let go of this assumption, we can get by with just enable/disable, 
so i think i'll just document the thread safety and leave out the "BUSY" 
part.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 00/11] Add PMD power management
  2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
                                 ` (11 preceding siblings ...)
  2021-01-14  9:36               ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management David Marchand
@ 2021-01-14 14:46               ` Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 01/11] eal: uninline power intrinsics Anatoly Burakov
                                   ` (12 more replies)
  12 siblings, 13 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt,
	bruce.richardson, chris.macnamara

This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. There are multiple proposed mechanisms to achieve said power
savings: simple frequency scaling, idle loop, and monitoring the Rx
queue for incoming packages. The latter is achieved through cooperation
with the NIC driver that will allow us to know address of wake up event,
and wait for writes on that address.

On IA, this is achieved through using UMONITOR/UMWAIT instructions. They 
are used in their raw opcode form because there is no widespread 
compiler support for them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen to implement 
similar instructions.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we employ
one of the suggested power management schemes automatically, from within
a Rx callback inside the PMD. Once there's traffic again, the empty poll
counter is reset.

This patchset also introduces a few changes into existing power 
management-related intrinsics, namely to provide a native way of waking 
up a sleeping core without application being responsible for it, as well 
as general robustness improvements. There's quite a bit of locking going 
on, but these locks are per-thread and very little (if any) contention 
is expected, so the performance impact shouldn't be that bad (and in any 
case the locking happens when we're about to sleep anyway).

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it with
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

Things of note:

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

v17:
- Added exception for ethdev driver-only ABI
- Added memory barriers for monitor/wakeup (Konstantin)
- Fixed compiled issues on non-x86 platforms (hopefully!)

v16:
- Implemented Konstantin's suggestions and comments
- Added return values to the API

v15:
- Fixed incorrect check in UMWAIT callback
- Fixed accidental whitespace changes

v14:
- Fixed ARM/PPC builds
- Addressed various review comments

v13:
- Reworked the librte_power code to require less locking and handle invalid
  parameters better
- Fix numerous rebase errors present in v12

v12:
- Rebase on top of 21.02
- Rework of power intrinsics code

Anatoly Burakov (5):
  eal: uninline power intrinsics
  eal: avoid invalid API usage in power intrinsics
  eal: change API of power intrinsics
  eal: remove sync version of power monitor
  eal: add monitor wakeup function

Liang Ma (6):
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt

 devtools/libabigail.abignore                  |   3 +
 doc/guides/prog_guide/power_man.rst           |  44 +++
 doc/guides/rel_notes/release_21_02.rst        |  15 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 drivers/event/dlb/dlb.c                       |  10 +-
 drivers/event/dlb2/dlb2.c                     |  10 +-
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  25 ++
 drivers/net/i40e/i40e_rxtx.h                  |   1 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   1 +
 examples/l3fwd-power/main.c                   |  89 ++++-
 .../arm/include/rte_power_intrinsics.h        |  40 --
 lib/librte_eal/arm/meson.build                |   1 +
 lib/librte_eal/arm/rte_power_intrinsics.c     |  40 ++
 .../include/generic/rte_power_intrinsics.h    |  88 ++---
 .../ppc/include/rte_power_intrinsics.h        |  40 --
 lib/librte_eal/ppc/meson.build                |   1 +
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  40 ++
 lib/librte_eal/version.map                    |   3 +
 .../x86/include/rte_power_intrinsics.h        | 115 ------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 215 +++++++++++
 lib/librte_ethdev/rte_ethdev.c                |  28 ++
 lib/librte_ethdev/rte_ethdev.h                |  25 ++
 lib/librte_ethdev/rte_ethdev_driver.h         |  22 ++
 lib/librte_ethdev/version.map                 |   3 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 364 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  90 +++++
 lib/librte_power/version.map                  |   5 +
 35 files changed, 1155 insertions(+), 259 deletions(-)
 create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 01/11] eal: uninline power intrinsics
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 02/11] eal: avoid invalid API usage in " Anatoly Burakov
                                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, power intrinsics are inline functions. Make them part of the
ABI so that we can have various internal data associated with them
without exposing said data to the outside world.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v14:
    - Fix compile issues on ARM and PPC64 by moving implementations to .c files

 .../arm/include/rte_power_intrinsics.h        |  40 ------
 lib/librte_eal/arm/meson.build                |   1 +
 lib/librte_eal/arm/rte_power_intrinsics.c     |  45 +++++++
 .../include/generic/rte_power_intrinsics.h    |   6 +-
 .../ppc/include/rte_power_intrinsics.h        |  40 ------
 lib/librte_eal/ppc/meson.build                |   1 +
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  45 +++++++
 lib/librte_eal/version.map                    |   3 +
 .../x86/include/rte_power_intrinsics.h        | 115 -----------------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 120 ++++++++++++++++++
 11 files changed, 219 insertions(+), 198 deletions(-)
 create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index a4a1bc1159..9e498e9ebf 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -13,46 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on ARM.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	RTE_SET_USED(tsc_timestamp);
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build
index d62875ebae..6ec53ea03a 100644
--- a/lib/librte_eal/arm/meson.build
+++ b/lib/librte_eal/arm/meson.build
@@ -7,4 +7,5 @@ sources += files(
 	'rte_cpuflags.c',
 	'rte_cycles.c',
 	'rte_hypervisor.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
new file mode 100644
index 0000000000..ab1f44f611
--- /dev/null
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2021 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index dd520d90fa..67977bd511 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -52,7 +52,7 @@
  *   to undefined result.
  */
 __rte_experimental
-static inline void rte_power_monitor(const volatile void *p,
+void rte_power_monitor(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz);
 
@@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p,
  *   wakes up.
  */
 __rte_experimental
-static inline void rte_power_monitor_sync(const volatile void *p,
+void rte_power_monitor_sync(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz,
 		rte_spinlock_t *lck);
@@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p,
  *   architecture-dependent.
  */
 __rte_experimental
-static inline void rte_power_pause(const uint64_t tsc_timestamp);
+void rte_power_pause(const uint64_t tsc_timestamp);
 
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 4ed03d521f..c0e9ac279f 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -13,46 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
-}
-
-/**
- * This function is not supported on PPC64.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	RTE_SET_USED(tsc_timestamp);
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/ppc/meson.build b/lib/librte_eal/ppc/meson.build
index f4b6d95c42..43c46542fb 100644
--- a/lib/librte_eal/ppc/meson.build
+++ b/lib/librte_eal/ppc/meson.build
@@ -7,4 +7,5 @@ sources += files(
 	'rte_cpuflags.c',
 	'rte_cycles.c',
 	'rte_hypervisor.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
new file mode 100644
index 0000000000..84340ca2a4
--- /dev/null
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2021 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+	RTE_SET_USED(lck);
+	RTE_SET_USED(data_sz);
+}
+
+/**
+ * This function is not supported on PPC64.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index b1db7ec795..32eceb8869 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -405,6 +405,9 @@ EXPERIMENTAL {
 	rte_vect_set_max_simd_bitwidth;
 
 	# added in 21.02
+	rte_power_monitor;
+	rte_power_monitor_sync;
+	rte_power_pause;
 	rte_thread_tls_key_create;
 	rte_thread_tls_key_delete;
 	rte_thread_tls_value_get;
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
index c7d790c854..e4c2b87f73 100644
--- a/lib/librte_eal/x86/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -13,121 +13,6 @@ extern "C" {
 
 #include "generic/rte_power_intrinsics.h"
 
-static inline uint64_t
-__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz)
-{
-	switch (sz) {
-	case sizeof(uint8_t):
-		return *(const volatile uint8_t *)p;
-	case sizeof(uint16_t):
-		return *(const volatile uint16_t *)p;
-	case sizeof(uint32_t):
-		return *(const volatile uint32_t *)p;
-	case sizeof(uint64_t):
-		return *(const volatile uint64_t *)p;
-	default:
-		/* this is an intrinsic, so we can't have any error handling */
-		RTE_ASSERT(0);
-		return 0;
-	}
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-}
-
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(p));
-
-	if (value_mask) {
-		const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
-			return;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-}
-
-/**
- * This function uses TPAUSE instruction  and will enter C0.2 state. For more
- * information about usage of this instruction, please refer to Intel(R) 64 and
- * IA-32 Architectures Software Developer's Manual.
- */
-static inline void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* execute TPAUSE */
-	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
-		: /* ignore rflags */
-		: "D"(0), /* enter C0.2 */
-		  "a"(tsc_l), "d"(tsc_h));
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build
index e78f29002e..dfd42dee0c 100644
--- a/lib/librte_eal/x86/meson.build
+++ b/lib/librte_eal/x86/meson.build
@@ -8,4 +8,5 @@ sources += files(
 	'rte_cycles.c',
 	'rte_hypervisor.c',
 	'rte_spinlock.c',
+	'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
new file mode 100644
index 0000000000..34c5fd9c3e
--- /dev/null
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -0,0 +1,120 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "rte_power_intrinsics.h"
+
+static inline uint64_t
+__get_umwait_val(const volatile void *p, const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):
+		return *(const volatile uint8_t *)p;
+	case sizeof(uint16_t):
+		return *(const volatile uint16_t *)p;
+	case sizeof(uint32_t):
+		return *(const volatile uint32_t *)p;
+	case sizeof(uint64_t):
+		return *(const volatile uint64_t *)p;
+	default:
+		/* this is an intrinsic, so we can't have any error handling */
+		RTE_ASSERT(0);
+		return 0;
+	}
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
+		const uint64_t value_mask, const uint64_t tsc_timestamp,
+		const uint8_t data_sz, rte_spinlock_t *lck)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = __get_umwait_val(p, data_sz);
+		const uint64_t masked = cur_value & value_mask;
+
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	rte_spinlock_unlock(lck);
+
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			  "a"(tsc_l), "d"(tsc_h));
+
+	rte_spinlock_lock(lck);
+}
+
+/**
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+			: /* ignore rflags */
+			: "D"(0), /* enter C0.2 */
+			"a"(tsc_l), "d"(tsc_h));
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 02/11] eal: avoid invalid API usage in power intrinsics
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 01/11] eal: uninline power intrinsics Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 03/11] eal: change API of " Anatoly Burakov
                                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel,
	david.hunt, chris.macnamara

Currently, the API documentation mandates that if the user wants to use
the power management intrinsics, they need to call the
`rte_cpu_get_intrinsics_support` API and check support for specific
intrinsics.

However, if the user does not do that, it is possible to get illegal
instruction error because we're using raw instruction opcodes, which may
or may not be supported at runtime.

Now that we have everything in a C file, we can check for support at
startup and prevent the user from possibly encountering illegal
instruction errors.

We also add return values to the API's as well, because why not.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v16:
    - Add return values and proper error handling to the API
    
    v15:
    - Remove accidental whitespace changes
    
    v14:
    - Replace uint8_t with bool
    
    v14:
    - Replace uint8_t with bool

 lib/librte_eal/arm/rte_power_intrinsics.c     | 12 +++-
 .../include/generic/rte_power_intrinsics.h    | 24 +++++--
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 12 +++-
 lib/librte_eal/x86/rte_power_intrinsics.c     | 64 +++++++++++++++++--
 4 files changed, 94 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index ab1f44f611..7e7552fa8a 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -7,7 +7,7 @@
 /**
  * This function is not supported on ARM.
  */
-void
+int
 rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz)
@@ -17,12 +17,14 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	RTE_SET_USED(value_mask);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(data_sz);
+
+	return -ENOTSUP;
 }
 
 /**
  * This function is not supported on ARM.
  */
-void
+int
 rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz, rte_spinlock_t *lck)
@@ -33,13 +35,17 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
 	RTE_SET_USED(data_sz);
+
+	return -ENOTSUP;
 }
 
 /**
  * This function is not supported on ARM.
  */
-void
+int
 rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
 }
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 67977bd511..37e4ec0414 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -34,7 +34,6 @@
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -50,9 +49,14 @@
  *   Data size (in bytes) that will be used to compare expected value with the
  *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
  *   to undefined result.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
  */
 __rte_experimental
-void rte_power_monitor(const volatile void *p,
+int rte_power_monitor(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz);
 
@@ -75,7 +79,6 @@ void rte_power_monitor(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param p
  *   Address to monitor for changes.
@@ -95,9 +98,14 @@ void rte_power_monitor(const volatile void *p,
  *   A spinlock that must be locked before entering the function, will be
  *   unlocked while the CPU is sleeping, and will be locked again once the CPU
  *   wakes up.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
  */
 __rte_experimental
-void rte_power_monitor_sync(const volatile void *p,
+int rte_power_monitor_sync(const volatile void *p,
 		const uint64_t expected_value, const uint64_t value_mask,
 		const uint64_t tsc_timestamp, const uint8_t data_sz,
 		rte_spinlock_t *lck);
@@ -111,13 +119,17 @@ void rte_power_monitor_sync(const volatile void *p,
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *   Failing to do so may result in an illegal CPU instruction error.
  *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
  */
 __rte_experimental
-void rte_power_pause(const uint64_t tsc_timestamp);
+int rte_power_pause(const uint64_t tsc_timestamp);
 
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 84340ca2a4..929e0611b0 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -7,7 +7,7 @@
 /**
  * This function is not supported on PPC64.
  */
-void
+int
 rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz)
@@ -17,12 +17,14 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	RTE_SET_USED(value_mask);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(data_sz);
+
+	return -ENOTSUP;
 }
 
 /**
  * This function is not supported on PPC64.
  */
-void
+int
 rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz, rte_spinlock_t *lck)
@@ -33,13 +35,17 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
 	RTE_SET_USED(data_sz);
+
+	return -ENOTSUP;
 }
 
 /**
  * This function is not supported on PPC64.
  */
-void
+int
 rte_power_pause(const uint64_t tsc_timestamp)
 {
 	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
 }
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 34c5fd9c3e..2a38440bec 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,8 @@
 
 #include "rte_power_intrinsics.h"
 
+static bool wait_supported;
+
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
 {
@@ -17,24 +19,47 @@ __get_umwait_val(const volatile void *p, const uint8_t sz)
 	case sizeof(uint64_t):
 		return *(const volatile uint64_t *)p;
 	default:
-		/* this is an intrinsic, so we can't have any error handling */
+		/* shouldn't happen */
 		RTE_ASSERT(0);
 		return 0;
 	}
 }
 
+static inline int
+__check_val_size(const uint8_t sz)
+{
+	switch (sz) {
+	case sizeof(uint8_t):  /* fall-through */
+	case sizeof(uint16_t): /* fall-through */
+	case sizeof(uint32_t): /* fall-through */
+	case sizeof(uint64_t): /* fall-through */
+		return 0;
+	default:
+		/* unexpected size */
+		return -1;
+	}
+}
+
 /**
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
  * For more information about usage of these instructions, please refer to
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
-void
+int
 rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return -ENOTSUP;
+
+	if (__check_val_size(data_sz) < 0)
+		return -EINVAL;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -51,13 +76,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 
 		/* if the masked value is already matching, abort */
 		if (masked == expected_value)
-			return;
+			return 0;
 	}
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			  "a"(tsc_l), "d"(tsc_h));
+
+	return 0;
 }
 
 /**
@@ -65,13 +92,21 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * For more information about usage of these instructions, please refer to
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
-void
+int
 rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 		const uint64_t value_mask, const uint64_t tsc_timestamp,
 		const uint8_t data_sz, rte_spinlock_t *lck)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return -ENOTSUP;
+
+	if (__check_val_size(data_sz) < 0)
+		return -EINVAL;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -88,7 +123,7 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 
 		/* if the masked value is already matching, abort */
 		if (masked == expected_value)
-			return;
+			return 0;
 	}
 	rte_spinlock_unlock(lck);
 
@@ -99,6 +134,8 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 			  "a"(tsc_l), "d"(tsc_h));
 
 	rte_spinlock_lock(lck);
+
+	return 0;
 }
 
 /**
@@ -106,15 +143,30 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
  * information about usage of this instruction, please refer to Intel(R) 64 and
  * IA-32 Architectures Software Developer's Manual.
  */
-void
+int
 rte_power_pause(const uint64_t tsc_timestamp)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
 
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return -ENOTSUP;
+
 	/* execute TPAUSE */
 	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			"a"(tsc_l), "d"(tsc_h));
+
+	return 0;
+}
+
+RTE_INIT(rte_power_intrinsics_init) {
+	struct rte_cpu_intrinsics i;
+
+	rte_cpu_get_intrinsics_support(&i);
+
+	if (i.power_monitor && i.power_pause)
+		wait_supported = 1;
 }
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 01/11] eal: uninline power intrinsics Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 02/11] eal: avoid invalid API usage in " Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-18 22:26                   ` Thomas Monjalon
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 04/11] eal: remove sync version of power monitor Anatoly Burakov
                                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev, thomas,
	david.hunt, chris.macnamara

Instead of passing around pointers and integers, collect everything
into struct. This makes API design around these intrinsics much easier.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v16:
    - Add error handling

 drivers/event/dlb/dlb.c                       | 10 ++--
 drivers/event/dlb2/dlb2.c                     | 10 ++--
 lib/librte_eal/arm/rte_power_intrinsics.c     | 20 +++-----
 .../include/generic/rte_power_intrinsics.h    | 50 ++++++++-----------
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 20 +++-----
 lib/librte_eal/x86/rte_power_intrinsics.c     | 42 +++++++++-------
 6 files changed, 70 insertions(+), 82 deletions(-)

diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c
index 0c95c4793d..d2f2026291 100644
--- a/drivers/event/dlb/dlb.c
+++ b/drivers/event/dlb/dlb.c
@@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		/* Interrupts not supported by PF PMD */
 		return 1;
 	} else if (dlb->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 86724863f2..c9a8a02278 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 	if (elapsed_ticks >= timeout) {
 		return 1;
 	} else if (dlb2->umwait_allowed) {
+		struct rte_power_monitor_cond pmc;
 		volatile struct dlb2_dequeue_qe *cq_base;
 		union {
 			uint64_t raw_qe[2];
@@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		else
 			expected_value = 0;
 
-		rte_power_monitor(monitor_addr, expected_value,
-				  qe_mask.raw_qe[1], timeout + start_ticks,
-				  sizeof(uint64_t));
+		pmc.addr = monitor_addr;
+		pmc.val = expected_value;
+		pmc.mask = qe_mask.raw_qe[1];
+		pmc.data_sz = sizeof(uint64_t);
+
+		rte_power_monitor(&pmc, timeout + start_ticks);
 
 		DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1);
 	} else {
diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index 7e7552fa8a..5f1caaf25b 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -8,15 +8,11 @@
  * This function is not supported on ARM.
  */
 int
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 
 	return -ENOTSUP;
 }
@@ -25,16 +21,12 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * This function is not supported on ARM.
  */
 int
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 
 	return -ENOTSUP;
 }
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 37e4ec0414..3ad53068d5 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -18,6 +18,18 @@
  * which are architecture-dependent.
  */
 
+struct rte_power_monitor_cond {
+	volatile void *addr;  /**< Address to monitor for changes */
+	uint64_t val;         /**< Before attempting the monitoring, the address
+	                       *   may be read and compared against this value.
+	                       **/
+	uint64_t mask;   /**< 64-bit mask to extract current value from addr */
+	uint8_t data_sz; /**< Data size (in bytes) that will be used to compare
+	                  *   expected value with the memory address. Can be 1,
+	                  *   2, 4, or 8. Supplying any other value will lead to
+	                  *   undefined result. */
+};
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
@@ -35,20 +47,11 @@
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  *
  * @return
  *   0 on success
@@ -56,10 +59,8 @@
  *   -ENOTSUP if unsupported
  */
 __rte_experimental
-int rte_power_monitor(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz);
-
+int rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp);
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
@@ -80,20 +81,11 @@ int rte_power_monitor(const volatile void *p,
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
  *
- * @param p
- *   Address to monitor for changes.
- * @param expected_value
- *   Before attempting the monitoring, the `p` address may be read and compared
- *   against this value. If `value_mask` is zero, this step will be skipped.
- * @param value_mask
- *   The 64-bit mask to use to extract current value from `p`.
+ * @param pmc
+ *   The monitoring condition structure.
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
- * @param data_sz
- *   Data size (in bytes) that will be used to compare expected value with the
- *   memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead
- *   to undefined result.
  * @param lck
  *   A spinlock that must be locked before entering the function, will be
  *   unlocked while the CPU is sleeping, and will be locked again once the CPU
@@ -105,10 +97,8 @@ int rte_power_monitor(const volatile void *p,
  *   -ENOTSUP if unsupported
  */
 __rte_experimental
-int rte_power_monitor_sync(const volatile void *p,
-		const uint64_t expected_value, const uint64_t value_mask,
-		const uint64_t tsc_timestamp, const uint8_t data_sz,
-		rte_spinlock_t *lck);
+int rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
 
 /**
  * @warning
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 929e0611b0..5e5a1fff5a 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -8,15 +8,11 @@
  * This function is not supported on PPC64.
  */
 int
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(data_sz);
 
 	return -ENOTSUP;
 }
@@ -25,16 +21,12 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * This function is not supported on PPC64.
  */
 int
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
-	RTE_SET_USED(p);
-	RTE_SET_USED(expected_value);
-	RTE_SET_USED(value_mask);
+	RTE_SET_USED(pmc);
 	RTE_SET_USED(tsc_timestamp);
 	RTE_SET_USED(lck);
-	RTE_SET_USED(data_sz);
 
 	return -ENOTSUP;
 }
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 2a38440bec..6be5c8b9f1 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -46,9 +46,8 @@ __check_val_size(const uint8_t sz)
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 int
-rte_power_monitor(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz)
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -57,7 +56,10 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	if (!wait_supported)
 		return -ENOTSUP;
 
-	if (__check_val_size(data_sz) < 0)
+	if (pmc == NULL)
+		return -EINVAL;
+
+	if (__check_val_size(pmc->data_sz) < 0)
 		return -EINVAL;
 
 	/*
@@ -68,14 +70,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return 0;
 	}
 	/* execute UMWAIT */
@@ -93,9 +96,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value,
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  */
 int
-rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
-		const uint64_t value_mask, const uint64_t tsc_timestamp,
-		const uint8_t data_sz, rte_spinlock_t *lck)
+rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
+		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
@@ -104,7 +106,10 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	if (!wait_supported)
 		return -ENOTSUP;
 
-	if (__check_val_size(data_sz) < 0)
+	if (pmc == NULL || lck == NULL)
+		return -EINVAL;
+
+	if (__check_val_size(pmc->data_sz) < 0)
 		return -EINVAL;
 
 	/*
@@ -115,14 +120,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value,
 	/* set address for UMONITOR */
 	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
 			:
-			: "D"(p));
+			: "D"(pmc->addr));
 
-	if (value_mask) {
-		const uint64_t cur_value = __get_umwait_val(p, data_sz);
-		const uint64_t masked = cur_value & value_mask;
+	if (pmc->mask) {
+		const uint64_t cur_value = __get_umwait_val(
+				pmc->addr, pmc->data_sz);
+		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == expected_value)
+		if (masked == pmc->val)
 			return 0;
 	}
 	rte_spinlock_unlock(lck);
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 04/11] eal: remove sync version of power monitor
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
                                   ` (2 preceding siblings ...)
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 03/11] eal: change API of " Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 05/11] eal: add monitor wakeup function Anatoly Burakov
                                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Currently, the "sync" version of power monitor intrinsic is supposed to
be used for purposes of waking up a sleeping core. However, there are
better ways to achieve the same result, so remove the unneeded function.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_eal/arm/rte_power_intrinsics.c     | 14 -----
 .../include/generic/rte_power_intrinsics.h    | 38 -------------
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 14 -----
 lib/librte_eal/version.map                    |  1 -
 lib/librte_eal/x86/rte_power_intrinsics.c     | 54 -------------------
 5 files changed, 121 deletions(-)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index 5f1caaf25b..8d271dc0c1 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -17,20 +17,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	return -ENOTSUP;
 }
 
-/**
- * This function is not supported on ARM.
- */
-int
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-
-	return -ENOTSUP;
-}
-
 /**
  * This function is not supported on ARM.
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 3ad53068d5..85343bc9eb 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -61,44 +61,6 @@ struct rte_power_monitor_cond {
 __rte_experimental
 int rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
-/**
- * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
- *
- * Monitor specific address for changes. This will cause the CPU to enter an
- * architecture-defined optimized power state until either the specified
- * memory address is written to, a certain TSC timestamp is reached, or other
- * reasons cause the CPU to wake up.
- *
- * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
- * mask is non-zero, the current value pointed to by the `p` pointer will be
- * checked against the expected value, and if they match, the entering of
- * optimized power state may be aborted.
- *
- * This call will also lock a spinlock on entering sleep, and release it on
- * waking up the CPU.
- *
- * @warning It is responsibility of the user to check if this function is
- *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
- *
- * @param pmc
- *   The monitoring condition structure.
- * @param tsc_timestamp
- *   Maximum TSC timestamp to wait for. Note that the wait behavior is
- *   architecture-dependent.
- * @param lck
- *   A spinlock that must be locked before entering the function, will be
- *   unlocked while the CPU is sleeping, and will be locked again once the CPU
- *   wakes up.
- *
- * @return
- *   0 on success
- *   -EINVAL on invalid parameters
- *   -ENOTSUP if unsupported
- */
-__rte_experimental
-int rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck);
 
 /**
  * @warning
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index 5e5a1fff5a..f7862ea324 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -17,20 +17,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	return -ENOTSUP;
 }
 
-/**
- * This function is not supported on PPC64.
- */
-int
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	RTE_SET_USED(pmc);
-	RTE_SET_USED(tsc_timestamp);
-	RTE_SET_USED(lck);
-
-	return -ENOTSUP;
-}
-
 /**
  * This function is not supported on PPC64.
  */
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 32eceb8869..1fcd1d3bed 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,7 +406,6 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
-	rte_power_monitor_sync;
 	rte_power_pause;
 	rte_thread_tls_key_create;
 	rte_thread_tls_key_delete;
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 6be5c8b9f1..29247d8638 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -90,60 +90,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	return 0;
 }
 
-/**
- * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
- * For more information about usage of these instructions, please refer to
- * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
- */
-int
-rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc,
-		const uint64_t tsc_timestamp, rte_spinlock_t *lck)
-{
-	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
-	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
-
-	/* prevent user from running this instruction if it's not supported */
-	if (!wait_supported)
-		return -ENOTSUP;
-
-	if (pmc == NULL || lck == NULL)
-		return -EINVAL;
-
-	if (__check_val_size(pmc->data_sz) < 0)
-		return -EINVAL;
-
-	/*
-	 * we're using raw byte codes for now as only the newest compiler
-	 * versions support this instruction natively.
-	 */
-
-	/* set address for UMONITOR */
-	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
-			:
-			: "D"(pmc->addr));
-
-	if (pmc->mask) {
-		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->data_sz);
-		const uint64_t masked = cur_value & pmc->mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
-			return 0;
-	}
-	rte_spinlock_unlock(lck);
-
-	/* execute UMWAIT */
-	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-			: /* ignore rflags */
-			: "D"(0), /* enter C0.2 */
-			  "a"(tsc_l), "d"(tsc_h));
-
-	rte_spinlock_lock(lck);
-
-	return 0;
-}
-
 /**
  * This function uses TPAUSE instruction  and will enter C0.2 state. For more
  * information about usage of this instruction, please refer to Intel(R) 64 and
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 05/11] eal: add monitor wakeup function
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
                                   ` (3 preceding siblings ...)
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 04/11] eal: remove sync version of power monitor Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 06/11] ethdev: add simple power management API Anatoly Burakov
                                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	thomas, timothy.mcdaniel, david.hunt, chris.macnamara

Now that we have everything in a C file, we can store the information
about our sleep, and have a native mechanism to wake up the sleeping
core. This mechanism would however only wake up a core that's sleeping
while monitoring - waking up from `rte_power_pause` won't work.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v17:
    - Improve code readability with a goto (the horror!)
    - Fix the compile issues for non-x86 archs
    
    v16:
    - Improve error handling
    - Take a lock before UMONITOR
    
    v13:
    - Add comments around wakeup code to explain what it does
    - Add lcore_id parameter checking to prevent buffer overrun
    
    v15:
    - Fix check in UMWAIT callback
    
    v13:
    - Rework the synchronization mechanism to not require locking
    - Add more parameter checking
    - Rework n_rx_queues access to not go through internal PMD structures and use
      public API instead
    
    v13:
    - Rework the synchronization mechanism to not require locking
    - Add more parameter checking
    - Rework n_rx_queues access to not go through internal PMD structures and use
      public API instead

 lib/librte_eal/arm/rte_power_intrinsics.c     | 11 +++
 .../include/generic/rte_power_intrinsics.h    | 16 ++++
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 11 +++
 lib/librte_eal/version.map                    |  1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 93 ++++++++++++++++++-
 5 files changed, 131 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index 8d271dc0c1..e83f04072a 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -27,3 +27,14 @@ rte_power_pause(const uint64_t tsc_timestamp)
 
 	return -ENOTSUP;
 }
+
+/**
+ * This function is not supported on ARM.
+ */
+int
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+
+	return -ENOTSUP;
+}
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 85343bc9eb..6109d28faa 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -62,6 +62,22 @@ __rte_experimental
 int rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Wake up a specific lcore that is in a power optimized state and is monitoring
+ * an address.
+ *
+ * @note This function will *not* wake up a core that is in a power optimized
+ *   state due to calling `rte_power_pause`.
+ *
+ * @param lcore_id
+ *   Lcore ID of a sleeping thread.
+ */
+__rte_experimental
+int rte_power_monitor_wakeup(const unsigned int lcore_id);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index f7862ea324..7fc9586da7 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -27,3 +27,14 @@ rte_power_pause(const uint64_t tsc_timestamp)
 
 	return -ENOTSUP;
 }
+
+/**
+ * This function is not supported on PPC64.
+ */
+int
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+
+	return -ENOTSUP;
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 1fcd1d3bed..fce90a112f 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,6 +406,7 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
+	rte_power_monitor_wakeup;
 	rte_power_pause;
 	rte_thread_tls_key_create;
 	rte_thread_tls_key_delete;
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 29247d8638..af3ae3237c 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -2,8 +2,31 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
+#include <rte_common.h>
+#include <rte_lcore.h>
+#include <rte_spinlock.h>
+
 #include "rte_power_intrinsics.h"
 
+/*
+ * Per-lcore structure holding current status of C0.2 sleeps.
+ */
+static struct power_wait_status {
+	rte_spinlock_t lock;
+	volatile void *monitor_addr; /**< NULL if not currently sleeping */
+} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+
+static inline void
+__umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	/* trigger a write but don't change the value */
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
 static bool wait_supported;
 
 static inline uint64_t
@@ -51,17 +74,29 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s;
 
 	/* prevent user from running this instruction if it's not supported */
 	if (!wait_supported)
 		return -ENOTSUP;
 
+	/* prevent non-EAL thread from using this API */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
 	if (pmc == NULL)
 		return -EINVAL;
 
 	if (__check_val_size(pmc->data_sz) < 0)
 		return -EINVAL;
 
+	s = &wait_status[lcore_id];
+
+	/* update sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = pmc->addr;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -72,6 +107,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 			:
 			: "D"(pmc->addr));
 
+	/* now that we've put this address into monitor, we can unlock */
+	rte_spinlock_unlock(&s->lock);
+
+	/* if we have a comparison mask, we might not need to sleep at all */
 	if (pmc->mask) {
 		const uint64_t cur_value = __get_umwait_val(
 				pmc->addr, pmc->data_sz);
@@ -79,14 +118,21 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 
 		/* if the masked value is already matching, abort */
 		if (masked == pmc->val)
-			return 0;
+			goto end;
 	}
+
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			  "a"(tsc_l), "d"(tsc_h));
 
+end:
+	/* erase sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = NULL;
+	rte_spinlock_unlock(&s->lock);
+
 	return 0;
 }
 
@@ -122,3 +168,48 @@ RTE_INIT(rte_power_intrinsics_init) {
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
 }
+
+int
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	struct power_wait_status *s;
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return -ENOTSUP;
+
+	/* prevent buffer overrun */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	s = &wait_status[lcore_id];
+
+	/*
+	 * There is a race condition between sleep, wakeup and locking, but we
+	 * don't need to handle it.
+	 *
+	 * Possible situations:
+	 *
+	 * 1. T1 locks, sets address, unlocks
+	 * 2. T2 locks, triggers wakeup, unlocks
+	 * 3. T1 sleeps
+	 *
+	 * In this case, because T1 has already set the address for monitoring,
+	 * we will wake up immediately even if T2 triggers wakeup before T1
+	 * goes to sleep.
+	 *
+	 * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up
+	 * 2. T2 locks, triggers wakeup, and unlocks
+	 * 3. T1 locks, erases address, and unlocks
+	 *
+	 * In this case, since we've already woken up, the "wakeup" was
+	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
+	 * wakeup address is still valid so it's perfectly safe to write it.
+	 */
+	rte_spinlock_lock(&s->lock);
+	if (s->monitor_addr != NULL)
+		__umwait_wakeup(s->monitor_addr);
+	rte_spinlock_unlock(&s->lock);
+
+	return 0;
+}
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 06/11] ethdev: add simple power management API
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
                                   ` (4 preceding siblings ...)
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 05/11] eal: add monitor wakeup function Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback Anatoly Burakov
                                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Ray Kinsella, Neil Horman, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple API to allow getting the monitor conditions for
power-optimized monitoring of the Rx queues from the PMD, as well as
release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---

Notes:
    v17:
    - Added libabigail ignore for driver-only ABI in ethdev as suggested by David
    
    v13:
    - Fix typos and issues raised by Andrew

 devtools/libabigail.abignore           |  3 +++
 doc/guides/rel_notes/release_21_02.rst |  5 +++++
 lib/librte_ethdev/rte_ethdev.c         | 28 ++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h         | 25 +++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h  | 22 ++++++++++++++++++++
 lib/librte_ethdev/version.map          |  3 +++
 6 files changed, 86 insertions(+)

diff --git a/devtools/libabigail.abignore b/devtools/libabigail.abignore
index 025f2c01bc..1c16114dce 100644
--- a/devtools/libabigail.abignore
+++ b/devtools/libabigail.abignore
@@ -7,3 +7,6 @@
         symbol_version = INTERNAL
 [suppress_variable]
         symbol_version = INTERNAL
+; Explicit ignore for driver-only ABI
+[suppress_type]
+        name = eth_dev_ops
diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index 706cbf8f0c..ec9958a141 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **ethdev: added new API for PMD power management**
+
+  * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
+    ``rte_power_monitor()`` to enable automatic power management for PMD's.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 17ddacc78d..e19dbd838b 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP);
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	if (pmc == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n",
+				pmc);
+		return -EINVAL;
+	}
+
+	return eth_err(port_id,
+		dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id],
+			pmc));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index f5f8919186..ca0f91312e 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -4334,6 +4335,30 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Retrieve the monitor condition for a given receive queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param pmc
+ *   The pointer point to power-optimized monitoring condition structure.
+ *
+ * @return
+ *   - 0: Success.
+ *   -ENOTSUP: Operation not supported.
+ *   -EINVAL: Invalid parameters.
+ *   -ENODEV: Invalid port ID.
+ */
+__rte_experimental
+int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index 0eacfd8425..3b3b0ec1a0 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t)
 	(struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction);
 /**< @internal Unbind peer queue from the current queue. */
 
+/**
+ * @internal
+ * Get address of memory location whose contents will change whenever there is
+ * new data to be received on an Rx queue.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param pmc
+ *   The pointer to power-optimized monitoring condition structure.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success
+ * @retval -EINVAL
+ *   Invalid parameters
+ */
+typedef int (*eth_get_monitor_addr_t)(void *rxq,
+		struct rte_power_monitor_cond *pmc);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -917,6 +937,8 @@ struct eth_dev_ops {
 	/**< Set up the connection between the pair of hairpin queues. */
 	eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind;
 	/**< Disconnect the hairpin queues of a pair from each other. */
+	eth_get_monitor_addr_t get_monitor_addr;
+	/**< Get power monitoring condition for Rx queue. */
 };
 
 /**
diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map
index d3f5410806..a124e1e370 100644
--- a/lib/librte_ethdev/version.map
+++ b/lib/librte_ethdev/version.map
@@ -240,6 +240,9 @@ EXPERIMENTAL {
 	rte_flow_get_restore_info;
 	rte_flow_tunnel_action_decap_release;
 	rte_flow_tunnel_item_release;
+
+	# added in 21.02
+	rte_eth_get_monitor_addr;
 };
 
 INTERNAL {
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
                                   ` (5 preceding siblings ...)
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 06/11] ethdev: add simple power management API Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-18 12:41                   ` David Hunt
  2021-01-18 22:48                   ` [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback Thomas Monjalon
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 08/11] net/ixgbe: implement power management API Anatoly Burakov
                                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas,
	konstantin.ananyev, timothy.mcdaniel, bruce.richardson,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. TPAUSE/Pause instruction

   This method uses the pause (or TPAUSE, if available) instruction to
   avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v17:
    - Added memory barriers suggested by Konstantin
    - Removed the BUSY state

 doc/guides/prog_guide/power_man.rst    |  44 +++
 doc/guides/rel_notes/release_21_02.rst |  10 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 364 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  90 ++++++
 lib/librte_power/version.map           |   5 +
 6 files changed, 516 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..02280dd689 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,47 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * Monitor
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX
+   descriptor address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not available, use ``rte_pause()``.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing ``librte_power`` library
+   functionality to scale the core frequency up/down depending on traffic
+   volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +241,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index ec9958a141..9cd8214e2d 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -60,6 +60,16 @@ New Features
   * ``rte_eth_get_monitor_addr()``, to be used in conjunction with
     ``rte_power_monitor()`` to enable automatic power management for PMD's.
 
+* **Add PMD power management helper API**
+
+  A new helper API has been added to make using Ethernet PMD power management
+  easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power
+  management schemes are supported initially:
+
+  * Power saving based on UMWAIT instruction (x86 only)
+  * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only)
+  * Power saving based on frequency scaling through the ``librte_power`` library
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 4b4cf1b90b..51a471b669 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..3dd463d69a
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,364 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+static struct pmd_conf_data {
+	struct rte_cpu_intrinsics intrinsics_support;
+	/**< what do we support? */
+	uint64_t tsc_per_us;
+	/**< pre-calculated tsc diff for 1us */
+	uint64_t pause_per_us;
+	/**< how many rte_pause can we fit in a microisecond? */
+} global_data;
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED
+};
+
+struct pmd_queue_cfg {
+	volatile enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	volatile bool umwait_in_progress;
+	/**< are we currently sleeping? */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+static void
+calc_tsc(void)
+{
+	const uint64_t hz = rte_get_timer_hz();
+	const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */
+
+	global_data.tsc_per_us = tsc_per_us;
+
+	/* only do this if we don't have tpause */
+	if (!global_data.intrinsics_support.power_pause) {
+		const uint64_t start = rte_rdtsc_precise();
+		const uint32_t n_pauses = 10000;
+		double us, us_per_pause;
+		uint64_t end;
+		unsigned int i;
+
+		/* estimate number of rte_pause() calls per us*/
+		for (i = 0; i < n_pauses; i++)
+			rte_pause();
+
+		end = rte_rdtsc_precise();
+		us = (end - start) / (double)tsc_per_us;
+		us_per_pause = us / n_pauses;
+
+		global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause);
+	}
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc;
+			uint16_t ret;
+
+			/*
+			 * we might get a cancellation request while being
+			 * inside the callback, in which case the wakeup
+			 * wouldn't work because it would've arrived too early.
+			 *
+			 * to get around this, we notify the other thread that
+			 * we're sleeping, so that it can spin until we're done.
+			 * unsolicited wakeups are perfectly safe.
+			 */
+			q_conf->umwait_in_progress = true;
+
+			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+			/* check if we need to cancel sleep */
+			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+				/* use monitoring condition to sleep */
+				ret = rte_eth_get_monitor_addr(port_id, qidx,
+						&pmc);
+				if (ret == 0)
+					rte_power_monitor(&pmc, -1ULL);
+			}
+			q_conf->umwait_in_progress = false;
+
+			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/* use tpause if we have it */
+			if (global_data.intrinsics_support.power_pause) {
+				const uint64_t cur = rte_rdtsc();
+				const uint64_t wait_tsc =
+						cur + global_data.tsc_per_us;
+				rte_power_pause(wait_tsc);
+			} else {
+				uint64_t i;
+				for (i = 0; i < global_data.pause_per_us; i++)
+					rte_pause();
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
+		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	struct rte_eth_dev_info info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if (rte_eth_dev_info_get(port_id, &info) < 0) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* check if queue id is valid */
+	if (queue_id >= info.nb_rx_queues) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* we need this in various places */
+	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		struct rte_power_monitor_cond dummy;
+
+		/* check if rte_power_monitor is supported */
+		if (!global_data.intrinsics_support.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_monitor_addr(port_id, queue_id,
+				&dummy) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->umwait_in_progress = false;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* ensure we update our state before callback starts */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* this is not necessary here, but do it anyway */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* figure out various time-to-tsc conversions */
+		if (global_data.tsc_per_us == 0)
+			calc_tsc();
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* this is not necessary here, but do it anyway */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	/* stop any callbacks from progressing */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	/* ensure we update our state before continuing */
+	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		bool exit = false;
+		do {
+			/*
+			 * we may request cancellation while the other thread
+			 * has just entered the callback but hasn't started
+			 * sleeping yet, so keep waking it up until we know it's
+			 * done sleeping.
+			 */
+			if (queue_cfg->umwait_in_progress)
+				rte_power_monitor_wakeup(lcore_id);
+			else
+				exit = true;
+		} while (!exit);
+	}
+	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..0bfbc6ba69
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** Use power-optimized monitoring to wait for incoming traffic */
+	RTE_POWER_MGMT_TYPE_MONITOR = 1,
+	/** Use power-optimized sleep to avoid busy polling */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Use frequency scaling when traffic is low */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable power management on a specified RX queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..61996b4d11 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,9 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+
+	# added in 21.02
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 08/11] net/ixgbe: implement power management API
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
                                   ` (6 preceding siblings ...)
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 09/11] net/i40e: " Anatoly Burakov
                                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jeff Guo, Haiyue Wang, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index d7a1806ab8..97acf35d24 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -560,6 +560,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_monitor_addr     = ixgbe_get_monitor_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 7bb8460359..cc8f70e6dd 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,31 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int
+ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	/* the registers are 32-bit */
+	pmc->data_sz = sizeof(uint32_t);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 6d2f7c9da3..8a25e98df6 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,6 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 09/11] net/i40e: implement power management API
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
                                   ` (7 preceding siblings ...)
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 08/11] net/ixgbe: implement power management API Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 10/11] net/ice: " Anatoly Burakov
                                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Beilei Xing, Jeff Guo, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 25 +++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 14622484a0..ba1abc584f 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -510,6 +510,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_monitor_addr             = i40e_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5df9a9df56..0b4220fc9c 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -72,6 +72,31 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	/* registers are 64-bit */
+	pmc->data_sz = sizeof(uint64_t);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..e1494525ce 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,7 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 10/11] net/ice: implement power management API
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
                                   ` (8 preceding siblings ...)
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 09/11] net/i40e: " Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
                                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Qiming Yang, Qi Zhang, thomas, konstantin.ananyev,
	timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_monitor_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 26 ++++++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  1 +
 3 files changed, 28 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 587f485ee3..38c6263946 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_monitor_addr             = ice_get_monitor_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index d052bd0f1b..066651dc48 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int
+ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	pmc->addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	/* register is 16-bit */
+	pmc->data_sz = sizeof(uint16_t);
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 6b16716063..906fbefdc4 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -263,6 +263,7 @@ uint16_t ice_xmit_pkts_vec_avx512(void *tx_queue, struct rte_mbuf **tx_pkts,
 				  uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v17 11/11] examples/l3fwd-power: enable PMD power mgmt
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
                                   ` (9 preceding siblings ...)
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 10/11] net/ice: " Anatoly Burakov
@ 2021-01-14 14:46                 ` Anatoly Burakov
  2021-01-18 12:53                   ` David Hunt
  2021-01-18 15:24                 ` [dpdk-dev] [PATCH v17 00/11] Add PMD power management David Marchand
  2021-01-18 22:52                 ` Thomas Monjalon
  12 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, thomas, konstantin.ananyev,
	timothy.mcdaniel, bruce.richardson, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v12:
    - Allow selecting PMD power management scheme from command-line
    - Enforce 1 core 1 queue rule

 .../sample_app_ug/l3_forward_power_man.rst    | 35 ++++++++
 examples/l3fwd-power/main.c                   | 89 ++++++++++++++++++-
 2 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 85a78a5c1e..aaa9367fae 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
+
+PMD Power Management Mode
+-------------------------
+There is also a traffic-aware operating mode that, instead of using explicit
+power management, will use automatic PMD power management. This mode is limited
+to one queue per core, and has three available power management schemes:
+
+* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a
+  power-optimized state (subject to platform support).
+
+* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid
+  busy looping when there is no traffic.
+
+* ``scale`` - this will use frequency scaling routines available in the
+  ``librte_power`` library.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK
+Programmer's Guide for more details on PMD power management.
+
+.. code-block:: console
+
+        ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 995a3b6ad7..e312b6f355 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,11 +200,14 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
 
+static enum rte_power_pmd_mgmt_type pmgmt_type;
+
 enum freq_scale_hint_t
 {
 	FREQ_LOWER    =      -1,
@@ -1611,7 +1615,9 @@ print_usage(const char *prgname)
 		" follow (training_flag, high_threshold, med_threshold)\n"
 		" --telemetry: enable telemetry mode, to update"
 		" empty polls, full polls, and core busyness to telemetry\n"
-		" --interrupt-only: enable interrupt-only mode\n",
+		" --interrupt-only: enable interrupt-only mode\n"
+		" --pmd-mgmt MODE: enable PMD power management mode. "
+		"Currently supported modes: monitor, pause, scale\n",
 		prgname);
 }
 
@@ -1701,6 +1707,32 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+
+static int
+parse_pmd_mgmt_config(const char *name)
+{
+#define PMD_MGMT_MONITOR "monitor"
+#define PMD_MGMT_PAUSE   "pause"
+#define PMD_MGMT_SCALE   "scale"
+
+	if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE;
+		return 0;
+	}
+	/* unknown PMD power management mode */
+	return -1;
+}
+
 static int
 parse_ep_config(const char *q_arg)
 {
@@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 1, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				if (parse_pmd_mgmt_config(optarg) < 0) {
+					printf(" Invalid PMD power management mode: %s\n",
+							optarg);
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2671,6 +2722,13 @@ main(int argc, char **argv)
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
+
+		/* PMD power management mode can only do 1 queue per core */
+		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
+			rte_exit(EXIT_FAILURE,
+				"In PMD power management mode, only one queue per lcore is allowed\n");
+		}
+
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2708,6 +2766,16 @@ main(int argc, char **argv)
 					rte_exit(EXIT_FAILURE,
 						 "Fail to add ptype cb\n");
 			}
+
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_pmd_mgmt_queue_enable(
+						lcore_id, portid, queueid,
+						pmgmt_type);
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+						"rte_power_pmd_mgmt_queue_enable: err=%d, port=%d\n",
+							ret, portid);
+			}
 		}
 	}
 
@@ -2798,6 +2866,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		/* reuse telemetry loop for PMD power management mode */
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2824,6 +2895,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback Anatoly Burakov
@ 2021-01-18 12:41                   ` David Hunt
  2021-01-19 16:45                     ` [dpdk-dev] [PATCH v18 0/2] Add PMD power management Anatoly Burakov
  2021-01-18 22:48                   ` [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback Thomas Monjalon
  1 sibling, 1 reply; 421+ messages in thread
From: David Hunt @ 2021-01-18 12:41 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Liang Ma, Ray Kinsella, Neil Horman, thomas, konstantin.ananyev,
	timothy.mcdaniel, bruce.richardson, chris.macnamara

Hi Anatoly,

On 14/1/2021 2:46 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
>
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
>
> This API mandates a core-to-single-queue mapping (that is, multiple
> queued per device are supported, but they have to be polled on different
> cores).
>
> This design is using PMD RX callbacks.
>
> 1. UMWAIT/UMONITOR:
>
>     When a certain threshold of empty polls is reached, the core will go
>     into a power optimized sleep while waiting on an address of next RX
>     descriptor to be written to.
>
> 2. TPAUSE/Pause instruction
>
>     This method uses the pause (or TPAUSE, if available) instruction to
>     avoid busy polling.
>
> 3. Frequency scaling
>     Reuse existing DPDK power library to scale up/down core frequency
>     depending on traffic volume.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
>      v17:
>      - Added memory barriers suggested by Konstantin
>      - Removed the BUSY state
>
>   doc/guides/prog_guide/power_man.rst    |  44 +++
>   doc/guides/rel_notes/release_21_02.rst |  10 +
>   lib/librte_power/meson.build           |   5 +-
>   lib/librte_power/rte_power_pmd_mgmt.c  | 364 +++++++++++++++++++++++++
>   lib/librte_power/rte_power_pmd_mgmt.h  |  90 ++++++
>   lib/librte_power/version.map           |   5 +
>   6 files changed, 516 insertions(+), 2 deletions(-)
>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
>

Acked-by: David Hunt <david.hunt@intel.com>



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 11/11] examples/l3fwd-power: enable PMD power mgmt
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2021-01-18 12:53                   ` David Hunt
  0 siblings, 0 replies; 421+ messages in thread
From: David Hunt @ 2021-01-18 12:53 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Liang Ma, thomas, konstantin.ananyev, timothy.mcdaniel,
	bruce.richardson, chris.macnamara

Hi Anatoly,

On 14/1/2021 2:46 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
>
> Add PMD power management feature support to l3fwd-power sample app.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
>      v12:
>      - Allow selecting PMD power management scheme from command-line
>      - Enforce 1 core 1 queue rule
>

Acked-by: David Hunt <david.hunt@intel.com>


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 00/11] Add PMD power management
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
                                   ` (10 preceding siblings ...)
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2021-01-18 15:24                 ` David Marchand
  2021-01-18 15:45                   ` Burakov, Anatoly
  2021-01-18 22:52                 ` Thomas Monjalon
  12 siblings, 1 reply; 421+ messages in thread
From: David Marchand @ 2021-01-18 15:24 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Timothy McDaniel,
	David Hunt, Bruce Richardson, chris.macnamara

On Thu, Jan 14, 2021 at 3:46 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> This patchset proposes a simple API for Ethernet drivers to cause the
> CPU to enter a power-optimized state while waiting for packets to
> arrive. There are multiple proposed mechanisms to achieve said power
> savings: simple frequency scaling, idle loop, and monitoring the Rx
> queue for incoming packages. The latter is achieved through cooperation
> with the NIC driver that will allow us to know address of wake up event,
> and wait for writes on that address.
>
> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
> are used in their raw opcode form because there is no widespread
> compiler support for them yet. Still, the API is made generic enough to
> hopefully support other architectures, if they happen to implement
> similar instructions.
>
> To achieve power savings, there is a very simple mechanism used: we're
> counting empty polls, and if a certain threshold is reached, we employ
> one of the suggested power management schemes automatically, from within
> a Rx callback inside the PMD. Once there's traffic again, the empty poll
> counter is reset.
>
> This patchset also introduces a few changes into existing power
> management-related intrinsics, namely to provide a native way of waking
> up a sleeping core without application being responsible for it, as well
> as general robustness improvements. There's quite a bit of locking going
> on, but these locks are per-thread and very little (if any) contention
> is expected, so the performance impact shouldn't be that bad (and in any
> case the locking happens when we're about to sleep anyway).
>
> Why are we putting it into ethdev as opposed to leaving this up to the
> application? Our customers specifically requested a way to do it with
> minimal changes to the application code. The current approach allows to
> just flip a switch and automatically have power savings.
>
> Things of note:
>
> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>   must at most handle RX on a single queue
> - Support 3 type policies. Monitor/Pause/Frequency Scaling
> - Power management is enabled per-queue
> - The API doesn't extend to other device types
>
> v17:
> - Added exception for ethdev driver-only ABI
> - Added memory barriers for monitor/wakeup (Konstantin)
> - Fixed compiled issues on non-x86 platforms (hopefully!)

SPDK build is still broken.
http://mails.dpdk.org/archives/test-report/2021-January/174840.html

==== 20 line log output for Ubuntu 18.04 (dpdk_compile_spdk): ====
rte_power_pmd_mgmt.c:(.text.experimental+0x1cc): undefined reference
to `rte_eth_add_rx_callback'
rte_power_pmd_mgmt.c:(.text.experimental+0x1f8): undefined reference
to `rte_eth_get_monitor_addr'
rte_power_pmd_mgmt.c:(.text.experimental+0x37f): undefined reference
to `rte_eth_dev_logtype'
/dpdk/build/lib/librte_power.a(librte_power_rte_power_pmd_mgmt.c.o):
In function `rte_power_pmd_mgmt_queue_disable':
rte_power_pmd_mgmt.c:(.text.experimental+0x42a): undefined reference
to `rte_eth_dev_is_valid_port'
rte_power_pmd_mgmt.c:(.text.experimental+0x4e7): undefined reference
to `rte_eth_remove_rx_callback'
rte_power_pmd_mgmt.c:(.text.experimental+0x536): undefined reference
to `rte_eth_remove_rx_callback'
rte_power_pmd_mgmt.c:(.text.experimental+0x54d): undefined reference
to `rte_eth_dev_logtype'
collect2: error: ld returned 1 exit status
/spdk/mk/spdk.app.mk:65: recipe for target 'iscsi_fuzz' failed
/spdk/mk/spdk.subdirs.mk:44: recipe for target 'iscsi_fuzz' failed
/spdk/mk/spdk.subdirs.mk:44: recipe for target 'fuzz' failed
make[4]: *** [iscsi_fuzz] Error 1
make[3]: *** [iscsi_fuzz] Error 2
make[2]: *** [fuzz] Error 2
/spdk/mk/spdk.subdirs.mk:44: recipe for target 'app' failed
make[1]: *** [app] Error 2
/spdk/mk/spdk.subdirs.mk:44: recipe for target 'test' failed
make: *** [test] Error 2
[2] Error running command.


I guess this is because of the added dependency of rte_ethdev to rte_power.
Afaics, SPDK does not use pkg-config:
https://github.com/spdk/spdk/blob/master/lib/env_dpdk/env.mk#L53


-- 
David Marchand


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 00/11] Add PMD power management
  2021-01-18 15:24                 ` [dpdk-dev] [PATCH v17 00/11] Add PMD power management David Marchand
@ 2021-01-18 15:45                   ` Burakov, Anatoly
  2021-01-18 16:06                     ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-18 15:45 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Timothy McDaniel,
	David Hunt, Bruce Richardson, chris.macnamara

On 18-Jan-21 3:24 PM, David Marchand wrote:
> On Thu, Jan 14, 2021 at 3:46 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> This patchset proposes a simple API for Ethernet drivers to cause the
>> CPU to enter a power-optimized state while waiting for packets to
>> arrive. There are multiple proposed mechanisms to achieve said power
>> savings: simple frequency scaling, idle loop, and monitoring the Rx
>> queue for incoming packages. The latter is achieved through cooperation
>> with the NIC driver that will allow us to know address of wake up event,
>> and wait for writes on that address.
>>
>> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
>> are used in their raw opcode form because there is no widespread
>> compiler support for them yet. Still, the API is made generic enough to
>> hopefully support other architectures, if they happen to implement
>> similar instructions.
>>
>> To achieve power savings, there is a very simple mechanism used: we're
>> counting empty polls, and if a certain threshold is reached, we employ
>> one of the suggested power management schemes automatically, from within
>> a Rx callback inside the PMD. Once there's traffic again, the empty poll
>> counter is reset.
>>
>> This patchset also introduces a few changes into existing power
>> management-related intrinsics, namely to provide a native way of waking
>> up a sleeping core without application being responsible for it, as well
>> as general robustness improvements. There's quite a bit of locking going
>> on, but these locks are per-thread and very little (if any) contention
>> is expected, so the performance impact shouldn't be that bad (and in any
>> case the locking happens when we're about to sleep anyway).
>>
>> Why are we putting it into ethdev as opposed to leaving this up to the
>> application? Our customers specifically requested a way to do it with
>> minimal changes to the application code. The current approach allows to
>> just flip a switch and automatically have power savings.
>>
>> Things of note:
>>
>> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>>    must at most handle RX on a single queue
>> - Support 3 type policies. Monitor/Pause/Frequency Scaling
>> - Power management is enabled per-queue
>> - The API doesn't extend to other device types
>>
>> v17:
>> - Added exception for ethdev driver-only ABI
>> - Added memory barriers for monitor/wakeup (Konstantin)
>> - Fixed compiled issues on non-x86 platforms (hopefully!)
> 
> SPDK build is still broken.
> http://mails.dpdk.org/archives/test-report/2021-January/174840.html
> 
> ==== 20 line log output for Ubuntu 18.04 (dpdk_compile_spdk): ====
> rte_power_pmd_mgmt.c:(.text.experimental+0x1cc): undefined reference
> to `rte_eth_add_rx_callback'
> rte_power_pmd_mgmt.c:(.text.experimental+0x1f8): undefined reference
> to `rte_eth_get_monitor_addr'
> rte_power_pmd_mgmt.c:(.text.experimental+0x37f): undefined reference
> to `rte_eth_dev_logtype'
> /dpdk/build/lib/librte_power.a(librte_power_rte_power_pmd_mgmt.c.o):
> In function `rte_power_pmd_mgmt_queue_disable':
> rte_power_pmd_mgmt.c:(.text.experimental+0x42a): undefined reference
> to `rte_eth_dev_is_valid_port'
> rte_power_pmd_mgmt.c:(.text.experimental+0x4e7): undefined reference
> to `rte_eth_remove_rx_callback'
> rte_power_pmd_mgmt.c:(.text.experimental+0x536): undefined reference
> to `rte_eth_remove_rx_callback'
> rte_power_pmd_mgmt.c:(.text.experimental+0x54d): undefined reference
> to `rte_eth_dev_logtype'
> collect2: error: ld returned 1 exit status
> /spdk/mk/spdk.app.mk:65: recipe for target 'iscsi_fuzz' failed
> /spdk/mk/spdk.subdirs.mk:44: recipe for target 'iscsi_fuzz' failed
> /spdk/mk/spdk.subdirs.mk:44: recipe for target 'fuzz' failed
> make[4]: *** [iscsi_fuzz] Error 1
> make[3]: *** [iscsi_fuzz] Error 2
> make[2]: *** [fuzz] Error 2
> /spdk/mk/spdk.subdirs.mk:44: recipe for target 'app' failed
> make[1]: *** [app] Error 2
> /spdk/mk/spdk.subdirs.mk:44: recipe for target 'test' failed
> make: *** [test] Error 2
> [2] Error running command.
> 
> 
> I guess this is because of the added dependency of rte_ethdev to rte_power.
> Afaics, SPDK does not use pkg-config:
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/env.mk#L53
> 
> 

Sooo... this is an SPDK issue then? Because i can't see any way of 
fixing the issue on DPDK side.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 00/11] Add PMD power management
  2021-01-18 15:45                   ` Burakov, Anatoly
@ 2021-01-18 16:06                     ` Thomas Monjalon
  2021-01-18 17:02                       ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-18 16:06 UTC (permalink / raw)
  To: David Marchand, Burakov, Anatoly, David Hunt, chris.macnamara
  Cc: dev, Ananyev, Konstantin, Timothy McDaniel, Bruce Richardson,
	andrew.rybchenko, ferruh.yigit, ajit.khaparde, jerinj

18/01/2021 16:45, Burakov, Anatoly:
> On 18-Jan-21 3:24 PM, David Marchand wrote:
> > On Thu, Jan 14, 2021 at 3:46 PM Anatoly Burakov
> > <anatoly.burakov@intel.com> wrote:
> >>
> >> This patchset proposes a simple API for Ethernet drivers to cause the
> >> CPU to enter a power-optimized state while waiting for packets to
> >> arrive. There are multiple proposed mechanisms to achieve said power
> >> savings: simple frequency scaling, idle loop, and monitoring the Rx
> >> queue for incoming packages. The latter is achieved through cooperation
> >> with the NIC driver that will allow us to know address of wake up event,
> >> and wait for writes on that address.
[...]
> >> Why are we putting it into ethdev as opposed to leaving this up to the
> >> application? Our customers specifically requested a way to do it with
> >> minimal changes to the application code. The current approach allows to
> >> just flip a switch and automatically have power savings.

The customer laziness is usually a bad justification :)
I think we could achieve the same with not too much code
on application side.
And I'm not sure hiding queue management is sane.
Remember this rule: application must remain in control.

[...]
> > SPDK build is still broken.
> > http://mails.dpdk.org/archives/test-report/2021-January/174840.html
[...]
> > I guess this is because of the added dependency of rte_ethdev to rte_power.
> > Afaics, SPDK does not use pkg-config:
> > https://github.com/spdk/spdk/blob/master/lib/env_dpdk/env.mk#L53
> 
> Sooo... this is an SPDK issue then? Because i can't see any way of 
> fixing the issue on DPDK side.

Yes SPDK should not skip pkg-config.
But it raises 2 question:
	- are we breaking ABI compatibility?
	- is ethdev management expected for librte_power?

It makes me wonder whether we should host the few functions mixing
librte_ethdev and librte_power somewhere else.
The question is where?



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 00/11] Add PMD power management
  2021-01-18 16:06                     ` Thomas Monjalon
@ 2021-01-18 17:02                       ` Burakov, Anatoly
  2021-01-18 17:54                         ` David Marchand
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-18 17:02 UTC (permalink / raw)
  To: Thomas Monjalon, David Marchand, David Hunt, chris.macnamara
  Cc: dev, Ananyev, Konstantin, Timothy McDaniel, Bruce Richardson,
	andrew.rybchenko, ferruh.yigit, ajit.khaparde, jerinj

On 18-Jan-21 4:06 PM, Thomas Monjalon wrote:
> 18/01/2021 16:45, Burakov, Anatoly:
>> On 18-Jan-21 3:24 PM, David Marchand wrote:
>>> On Thu, Jan 14, 2021 at 3:46 PM Anatoly Burakov
>>> <anatoly.burakov@intel.com> wrote:
>>>>
>>>> This patchset proposes a simple API for Ethernet drivers to cause the
>>>> CPU to enter a power-optimized state while waiting for packets to
>>>> arrive. There are multiple proposed mechanisms to achieve said power
>>>> savings: simple frequency scaling, idle loop, and monitoring the Rx
>>>> queue for incoming packages. The latter is achieved through cooperation
>>>> with the NIC driver that will allow us to know address of wake up event,
>>>> and wait for writes on that address.
> [...]
>>>> Why are we putting it into ethdev as opposed to leaving this up to the
>>>> application? Our customers specifically requested a way to do it with
>>>> minimal changes to the application code. The current approach allows to
>>>> just flip a switch and automatically have power savings.
> 
> The customer laziness is usually a bad justification :)
> I think we could achieve the same with not too much code
> on application side.

Yes, we could. Customers could basically take this patch and reimplement 
it inside their application, and get the same benefits (with also added 
benefit of having knowledge about their queue/core mapping, and so being 
able to use the PAUSE or SCALE schemes for more than one queue).

However, i still think it's a valid use case - if we can do it that way 
and have a ready-made power management story, why not?

> And I'm not sure hiding queue management is sane.
> Remember this rule: application must remain in control.
>

The application can still be in control by just not using the API and 
implementing things manually instead. Nothing is being taken away from 
the ability of application to be in control.

> [...]
>>> SPDK build is still broken.
>>> http://mails.dpdk.org/archives/test-report/2021-January/174840.html
> [...]
>>> I guess this is because of the added dependency of rte_ethdev to rte_power.
>>> Afaics, SPDK does not use pkg-config:
>>> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/env.mk#L53
>>
>> Sooo... this is an SPDK issue then? Because i can't see any way of
>> fixing the issue on DPDK side.
> 
> Yes SPDK should not skip pkg-config.
> But it raises 2 question:
> 	- are we breaking ABI compatibility?

Good question. Does including an extra intra-DPDK dependency count as 
ABI break? I was under impression that we didn't want DPDK to be 
distributed as individual libraries but rather would like it to be used 
as a whole, so if internal dependencies between components change, it's 
not a big deal (unless a third-party build system is used that 
explicitly specifies dependencies rather than using pkg-config).

> 	- is ethdev management expected for librte_power?
> 
> It makes me wonder whether we should host the few functions mixing
> librte_ethdev and librte_power somewhere else.
> The question is where?
> 

That could be another possibility. We could put this into a separate 
library, but IMO it would serve no purpose other than avoiding adding a 
dependency on *internal* component to librte_power. I'm not sure it's a 
worthy trade off.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 00/11] Add PMD power management
  2021-01-18 17:02                       ` Burakov, Anatoly
@ 2021-01-18 17:54                         ` David Marchand
  0 siblings, 0 replies; 421+ messages in thread
From: David Marchand @ 2021-01-18 17:54 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Thomas Monjalon, David Hunt, chris.macnamara, dev, Ananyev,
	Konstantin, Timothy McDaniel, Bruce Richardson, Andrew Rybchenko,
	Yigit, Ferruh, Ajit Khaparde, Jerin Jacob Kollanukkaran

On Mon, Jan 18, 2021 at 6:02 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
> >>> SPDK build is still broken.
> >>> http://mails.dpdk.org/archives/test-report/2021-January/174840.html
> > [...]
> >>> I guess this is because of the added dependency of rte_ethdev to rte_power.
> >>> Afaics, SPDK does not use pkg-config:
> >>> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/env.mk#L53
> >>
> >> Sooo... this is an SPDK issue then? Because i can't see any way of
> >> fixing the issue on DPDK side.
> >
> > Yes SPDK should not skip pkg-config.
> > But it raises 2 question:
> >       - are we breaking ABI compatibility?
>
> Good question. Does including an extra intra-DPDK dependency count as
> ABI break? I was under impression that we didn't want DPDK to be
> distributed as individual libraries but rather would like it to be used
> as a whole, so if internal dependencies between components change, it's
> not a big deal (unless a third-party build system is used that
> explicitly specifies dependencies rather than using pkg-config).

I don't get where an ABI breakage would be.

What I reported is an issue with static link.

For shared link, I would expect librte_power would expose its
dependency on rte_ethdev via a DT_NEEDED entry.
The final binary does not have to be aware of it.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 03/11] eal: change API of " Anatoly Burakov
@ 2021-01-18 22:26                   ` Thomas Monjalon
  2021-01-19 10:29                     ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-18 22:26 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	david.hunt, chris.macnamara

14/01/2021 15:46, Anatoly Burakov:
> +struct rte_power_monitor_cond {
> +	volatile void *addr;  /**< Address to monitor for changes */
> +	uint64_t val;         /**< Before attempting the monitoring, the address
> +	                       *   may be read and compared against this value.

"may" be read and compared?
Is there a case where there is no read and compare?

> +	                       **/
> +	uint64_t mask;   /**< 64-bit mask to extract current value from addr */
> +	uint8_t data_sz; /**< Data size (in bytes) that will be used to compare
> +	                  *   expected value with the memory address. Can be 1,
> +	                  *   2, 4, or 8. Supplying any other value will lead to
> +	                  *   undefined result. */

Other parameters are not prefixed with "data_",
so I think this field could be simply named "size".

> +};

I understand this struct is a direct translation of what existed
in 20.11 as function parameters and comments.
If you agree, these comments could be addressed in a separate patch.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback
  2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback Anatoly Burakov
  2021-01-18 12:41                   ` David Hunt
@ 2021-01-18 22:48                   ` Thomas Monjalon
  2021-01-19 12:25                     ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-18 22:48 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, Liang Ma, David Hunt, Ray Kinsella, Neil Horman,
	konstantin.ananyev, timothy.mcdaniel, bruce.richardson,
	chris.macnamara

14/01/2021 15:46, Anatoly Burakov:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> +   Currently, this power management API is limited to mandatory mapping of 1
> +   queue to 1 core (multiple queues are supported, but they must be polled from
> +   different cores).

This is quite limited.
Not sure librte_power is the right place for a flexible ethdev management.

> +
> +API Overview for PMD Power Management
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Underlining should be shorter please.

> +* **Queue Enable**: Enable specific power scheme for certain queue/port/core
> +
> +* **Queue Disable**: Disable power scheme for certain queue/port/core

Please terminate sentences with a dot.

> +
>  References
>  ----------
>  
> @@ -200,3 +241,6 @@ References
>  
>  *   The :doc:`../sample_app_ug/vm_power_management`
>      chapter in the :doc:`../sample_app_ug/index` section.
> +
> +*   The :doc:`../sample_app_ug/rxtx_callbacks`
> +    chapter in the :doc:`../sample_app_ug/index` section.

Why the index page is mentionned here?


> --- a/doc/guides/rel_notes/release_21_02.rst
> +++ b/doc/guides/rel_notes/release_21_02.rst
> +* **Add PMD power management helper API**

Please follow release notes guidelines (past tense and dot).

> +
> +  A new helper API has been added to make using Ethernet PMD power management
> +  easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power
> +  management schemes are supported initially:
> +
> +  * Power saving based on UMWAIT instruction (x86 only)
> +  * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only)
> +  * Power saving based on frequency scaling through the ``librte_power`` library
[...]
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> -deps += ['timer']
> +deps += ['timer' ,'ethdev']

Wrapping ethdev looks very strange to me.


> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
> @@ -0,0 +1,364 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation

2010?

> + */
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_cpuflags.h>
> +#include <rte_malloc.h>
> +#include <rte_ethdev.h>
> +#include <rte_power_intrinsics.h>
> +
> +#include "rte_power_pmd_mgmt.h"
> +
> +#define EMPTYPOLL_MAX  512
> +
> +static struct pmd_conf_data {
> +	struct rte_cpu_intrinsics intrinsics_support;
> +	/**< what do we support? */
> +	uint64_t tsc_per_us;
> +	/**< pre-calculated tsc diff for 1us */
> +	uint64_t pause_per_us;
> +	/**< how many rte_pause can we fit in a microisecond? */

Vim typo spotted: microisecond

> +} global_data;

Not sure about the need for a struct.
Please insert comment before the field if not on the same line.
BTW, why doxygen syntax in a .c file?


> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.h
> @@ -0,0 +1,90 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_PMD_MGMT_H
> +#define _RTE_POWER_PMD_MGMT_H
> +
> +/**
> + * @file
> + * RTE PMD Power Management
> + */

blank line here?

> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_power.h>
> +#include <rte_atomic.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * PMD Power Management Type
> + */
> +enum rte_power_pmd_mgmt_type {
> +	/** Use power-optimized monitoring to wait for incoming traffic */
> +	RTE_POWER_MGMT_TYPE_MONITOR = 1,
> +	/** Use power-optimized sleep to avoid busy polling */
> +	RTE_POWER_MGMT_TYPE_PAUSE,
> +	/** Use frequency scaling when traffic is low */
> +	RTE_POWER_MGMT_TYPE_SCALE,
> +};
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice

Dot at the end please.

> + *
> + * Enable power management on a specified RX queue and lcore.
> + *
> + * @note This function is not thread-safe.
> + *
> + * @param lcore_id
> + *   lcore_id.

Interesting comment :)

> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The queue identifier of the Ethernet device.
> + * @param mode
> + *   The power management callback function type.
> +
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int
> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id,
> +		enum rte_power_pmd_mgmt_type mode);

In reality it is an API for ethdev Rx queue, not general PMD.
The function should be renamed accordingly.

[...]
> --- a/lib/librte_power/version.map
> +++ b/lib/librte_power/version.map
> +	# added in 21.02
> +	rte_power_pmd_mgmt_queue_enable;
> +	rte_power_pmd_mgmt_queue_disable;

Alpha sort please.




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 00/11] Add PMD power management
  2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
                                   ` (11 preceding siblings ...)
  2021-01-18 15:24                 ` [dpdk-dev] [PATCH v17 00/11] Add PMD power management David Marchand
@ 2021-01-18 22:52                 ` Thomas Monjalon
  2021-01-19 10:30                   ` Burakov, Anatoly
  12 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-18 22:52 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, konstantin.ananyev, timothy.mcdaniel, david.hunt,
	bruce.richardson, chris.macnamara, david.marchand

14/01/2021 15:46, Anatoly Burakov:
> Anatoly Burakov (5):
>   eal: uninline power intrinsics
>   eal: avoid invalid API usage in power intrinsics
>   eal: change API of power intrinsics
>   eal: remove sync version of power monitor
>   eal: add monitor wakeup function
> 
> Liang Ma (6):
>   ethdev: add simple power management API
>   power: add PMD power management API and callback
>   net/ixgbe: implement power management API
>   net/i40e: implement power management API
>   net/ice: implement power management API
>   examples/l3fwd-power: enable PMD power mgmt

The librte_power part may deserve another iteration.
I suggest to give it a chance for a better version in -rc2.

For 21.02-rc1, the EAL and ethdev (including PMDs) parts are merged, thanks.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-18 22:26                   ` Thomas Monjalon
@ 2021-01-19 10:29                     ` Burakov, Anatoly
  2021-01-19 10:42                       ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-19 10:29 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	david.hunt, chris.macnamara

On 18-Jan-21 10:26 PM, Thomas Monjalon wrote:
> 14/01/2021 15:46, Anatoly Burakov:
>> +struct rte_power_monitor_cond {
>> +	volatile void *addr;  /**< Address to monitor for changes */
>> +	uint64_t val;         /**< Before attempting the monitoring, the address
>> +	                       *   may be read and compared against this value.
> 
> "may" be read and compared?
> Is there a case where there is no read and compare?

Yes, if the mask is not set.

> 
>> +	                       **/
>> +	uint64_t mask;   /**< 64-bit mask to extract current value from addr */
>> +	uint8_t data_sz; /**< Data size (in bytes) that will be used to compare
>> +	                  *   expected value with the memory address. Can be 1,
>> +	                  *   2, 4, or 8. Supplying any other value will lead to
>> +	                  *   undefined result. */
> 
> Other parameters are not prefixed with "data_",
> so I think this field could be simply named "size".

OK.

> 
>> +};
> 
> I understand this struct is a direct translation of what existed
> in 20.11 as function parameters and comments.
> If you agree, these comments could be addressed in a separate patch.
> 

I'll be respinning anyway, so might as well do some quick fixups.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 00/11] Add PMD power management
  2021-01-18 22:52                 ` Thomas Monjalon
@ 2021-01-19 10:30                   ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-19 10:30 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, konstantin.ananyev, timothy.mcdaniel, david.hunt,
	bruce.richardson, chris.macnamara, david.marchand

On 18-Jan-21 10:52 PM, Thomas Monjalon wrote:
> 14/01/2021 15:46, Anatoly Burakov:
>> Anatoly Burakov (5):
>>    eal: uninline power intrinsics
>>    eal: avoid invalid API usage in power intrinsics
>>    eal: change API of power intrinsics
>>    eal: remove sync version of power monitor
>>    eal: add monitor wakeup function
>>
>> Liang Ma (6):
>>    ethdev: add simple power management API
>>    power: add PMD power management API and callback
>>    net/ixgbe: implement power management API
>>    net/i40e: implement power management API
>>    net/ice: implement power management API
>>    examples/l3fwd-power: enable PMD power mgmt
> 
> The librte_power part may deserve another iteration.
> I suggest to give it a chance for a better version in -rc2.
> 
> For 21.02-rc1, the EAL and ethdev (including PMDs) parts are merged, thanks.
> 

Good to hear, thanks!

I'll get on the respin of remaining parts.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-19 10:29                     ` Burakov, Anatoly
@ 2021-01-19 10:42                       ` Thomas Monjalon
  2021-01-19 11:23                         ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-19 10:42 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	david.hunt, chris.macnamara

19/01/2021 11:29, Burakov, Anatoly:
> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote:
> > 14/01/2021 15:46, Anatoly Burakov:
> >> +struct rte_power_monitor_cond {
> >> +	volatile void *addr;  /**< Address to monitor for changes */
> >> +	uint64_t val;         /**< Before attempting the monitoring, the address
> >> +	                       *   may be read and compared against this value.
> > 
> > "may" be read and compared?
> > Is there a case where there is no read and compare?
> 
> Yes, if the mask is not set.

If the mask is not set, the address is "read" anyway
or it is only "watched" for any change?

Sorry the mechanism is really not clear to me.




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-19 10:42                       ` Thomas Monjalon
@ 2021-01-19 11:23                         ` Burakov, Anatoly
  2021-01-19 14:17                           ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-19 11:23 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	david.hunt, chris.macnamara

On 19-Jan-21 10:42 AM, Thomas Monjalon wrote:
> 19/01/2021 11:29, Burakov, Anatoly:
>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote:
>>> 14/01/2021 15:46, Anatoly Burakov:
>>>> +struct rte_power_monitor_cond {
>>>> +	volatile void *addr;  /**< Address to monitor for changes */
>>>> +	uint64_t val;         /**< Before attempting the monitoring, the address
>>>> +	                       *   may be read and compared against this value.
>>>
>>> "may" be read and compared?
>>> Is there a case where there is no read and compare?
>>
>> Yes, if the mask is not set.
> 
> If the mask is not set, the address is "read" anyway
> or it is only "watched" for any change?
> 
> Sorry the mechanism is really not clear to me.
> 

The "value" is only used to avoid the sleep, i.e. to check if the write 
has already happened. We're waiting on *a write* rather than *a value*, 
so it's not equivalent to "wait until equal" call. It's more of a "sleep 
until something happens".

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback
  2021-01-18 22:48                   ` [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback Thomas Monjalon
@ 2021-01-19 12:25                     ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-19 12:25 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Liang Ma, David Hunt, Ray Kinsella, Neil Horman,
	konstantin.ananyev, timothy.mcdaniel, bruce.richardson,
	chris.macnamara

On 18-Jan-21 10:48 PM, Thomas Monjalon wrote:
> 14/01/2021 15:46, Anatoly Burakov:
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> +   Currently, this power management API is limited to mandatory mapping of 1
>> +   queue to 1 core (multiple queues are supported, but they must be polled from
>> +   different cores).
> 
> This is quite limited.
> Not sure librte_power is the right place for a flexible ethdev management.
> 

It's not really "managing" ethdev as such, it just installs a callback. 
You could say it's building on what's available in ethdev, but aside 
from installing a callback it doesn't do anything else.

<snip>

>> +static struct pmd_conf_data {
>> +	struct rte_cpu_intrinsics intrinsics_support;
>> +	/**< what do we support? */
>> +	uint64_t tsc_per_us;
>> +	/**< pre-calculated tsc diff for 1us */
>> +	uint64_t pause_per_us;
>> +	/**< how many rte_pause can we fit in a microisecond? */
> 
> Vim typo spotted: microisecond
> 
>> +} global_data;
> 
> Not sure about the need for a struct.
> Please insert comment before the field if not on the same line.
> BTW, why doxygen syntax in a .c file?
> 

The struct was really there to make autocomplete easier. I can make all 
of the variables static and pull them out if that's necessary, but i 
don't think it makes much difference.

(the rest of the comments will be implemented)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-19 11:23                         ` Burakov, Anatoly
@ 2021-01-19 14:17                           ` Thomas Monjalon
  2021-01-20 10:32                             ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-19 14:17 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	david.hunt, chris.macnamara

19/01/2021 12:23, Burakov, Anatoly:
> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote:
> > 19/01/2021 11:29, Burakov, Anatoly:
> >> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote:
> >>> 14/01/2021 15:46, Anatoly Burakov:
> >>>> +struct rte_power_monitor_cond {
> >>>> +	volatile void *addr;  /**< Address to monitor for changes */
> >>>> +	uint64_t val;         /**< Before attempting the monitoring, the address
> >>>> +	                       *   may be read and compared against this value.
> >>>
> >>> "may" be read and compared?
> >>> Is there a case where there is no read and compare?
> >>
> >> Yes, if the mask is not set.
> > 
> > If the mask is not set, the address is "read" anyway
> > or it is only "watched" for any change?
> > 
> > Sorry the mechanism is really not clear to me.
> > 
> 
> The "value" is only used to avoid the sleep, i.e. to check if the write 
> has already happened. We're waiting on *a write* rather than *a value*, 
> so it's not equivalent to "wait until equal" call. It's more of a "sleep 
> until something happens".

Please make things explicit in doxygen.
The behaviour of each case should be explained crystal clear.
Thanks



^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v18 0/2] Add PMD power management
  2021-01-18 12:41                   ` David Hunt
@ 2021-01-19 16:45                     ` Anatoly Burakov
  2021-01-19 16:45                       ` [dpdk-dev] [PATCH v18 1/2] power: add PMD power management API and callback Anatoly Burakov
                                         ` (2 more replies)
  0 siblings, 3 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-19 16:45 UTC (permalink / raw)
  To: dev; +Cc: thomas

This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. There are multiple proposed mechanisms to achieve said power
savings: simple frequency scaling, idle loop, and monitoring the Rx
queue for incoming packages. The latter is achieved through cooperation
with the NIC driver that will allow us to know address of wake up event,
and wait for writes on that address.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we employ
one of the suggested power management schemes automatically, from within
a Rx callback inside the PMD. Once there's traffic again, the empty poll
counter is reset.

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it with
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

Things of note:

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

v18:
- Rebase on top of latest main
- Address review comments by Thomas

v17:
- Added exception for ethdev driver-only ABI
- Added memory barriers for monitor/wakeup (Konstantin)
- Fixed compiled issues on non-x86 platforms (hopefully!)

v16:
- Implemented Konstantin's suggestions and comments
- Added return values to the API

v15:
- Fixed incorrect check in UMWAIT callback
- Fixed accidental whitespace changes

v14:
- Fixed ARM/PPC builds
- Addressed various review comments

v13:
- Reworked the librte_power code to require less locking and handle invalid
  parameters better
- Fix numerous rebase errors present in v12

v12:
- Rebase on top of 21.02
- Rework of power intrinsics code

Liang Ma (2):
  power: add PMD power management API and callback
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  41 ++
 doc/guides/rel_notes/release_21_02.rst        |  10 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 examples/l3fwd-power/main.c                   |  90 ++++-
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 365 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  91 +++++
 lib/librte_power/version.map                  |   5 +
 8 files changed, 638 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v18 1/2] power: add PMD power management API and callback
  2021-01-19 16:45                     ` [dpdk-dev] [PATCH v18 0/2] Add PMD power management Anatoly Burakov
@ 2021-01-19 16:45                       ` Anatoly Burakov
  2021-01-19 16:45                       ` [dpdk-dev] [PATCH v18 2/2] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  2021-01-20 11:50                       ` [dpdk-dev] [PATCH v19 0/4] Add PMD power management Anatoly Burakov
  2 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-19 16:45 UTC (permalink / raw)
  To: dev; +Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. TPAUSE/Pause instruction

   This method uses the pause (or TPAUSE, if available) instruction to
   avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v17:
    - Added memory barriers suggested by Konstantin
    - Removed the BUSY state

 doc/guides/prog_guide/power_man.rst    |  41 +++
 doc/guides/rel_notes/release_21_02.rst |  10 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 365 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  91 ++++++
 lib/librte_power/version.map           |   5 +
 6 files changed, 515 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..f36ba0027c 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,47 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+Ethernet PMD Power Management API
+---------------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * Monitor
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX
+   descriptor address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not available, use ``rte_pause()``.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing ``librte_power`` library
+   functionality to scale the core frequency up/down depending on traffic
+   volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for Ethernet PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core.
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core.
+
 References
 ----------
 
diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index be5ea4370c..1988960b76 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -76,6 +76,16 @@ New Features
 
   * Added inner UDP/IPv4 support for VXLAN IPv4 GSO.
 
+* **Added Ethernet PMD power management helper API.**
+
+  A new helper API has been added to make using Ethernet PMD power management
+  easier for the user: ``rte_power_ethdev_pmgmt_queue_enable()``. Three power
+  management schemes are supported initially:
+
+  * Power saving based on UMWAIT instruction (x86 only)
+  * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only)
+  * Power saving based on frequency scaling through the ``librte_power`` library
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 4b4cf1b90b..e5a11cb834 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer', 'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..454ef7091e
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,365 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+/* store some internal state */
+static struct pmd_conf_data {
+	/** what do we support? */
+	struct rte_cpu_intrinsics intrinsics_support;
+	/** pre-calculated tsc diff for 1us */
+	uint64_t tsc_per_us;
+	/** how many rte_pause can we fit in a microsecond? */
+	uint64_t pause_per_us;
+} global_data;
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED
+};
+
+struct pmd_queue_cfg {
+	volatile enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	volatile bool umwait_in_progress;
+	/**< are we currently sleeping? */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+static void
+calc_tsc(void)
+{
+	const uint64_t hz = rte_get_timer_hz();
+	const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */
+
+	global_data.tsc_per_us = tsc_per_us;
+
+	/* only do this if we don't have tpause */
+	if (!global_data.intrinsics_support.power_pause) {
+		const uint64_t start = rte_rdtsc_precise();
+		const uint32_t n_pauses = 10000;
+		double us, us_per_pause;
+		uint64_t end;
+		unsigned int i;
+
+		/* estimate number of rte_pause() calls per us*/
+		for (i = 0; i < n_pauses; i++)
+			rte_pause();
+
+		end = rte_rdtsc_precise();
+		us = (end - start) / (double)tsc_per_us;
+		us_per_pause = us / n_pauses;
+
+		global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause);
+	}
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc;
+			uint16_t ret;
+
+			/*
+			 * we might get a cancellation request while being
+			 * inside the callback, in which case the wakeup
+			 * wouldn't work because it would've arrived too early.
+			 *
+			 * to get around this, we notify the other thread that
+			 * we're sleeping, so that it can spin until we're done.
+			 * unsolicited wakeups are perfectly safe.
+			 */
+			q_conf->umwait_in_progress = true;
+
+			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+			/* check if we need to cancel sleep */
+			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+				/* use monitoring condition to sleep */
+				ret = rte_eth_get_monitor_addr(port_id, qidx,
+						&pmc);
+				if (ret == 0)
+					rte_power_monitor(&pmc, -1ULL);
+			}
+			q_conf->umwait_in_progress = false;
+
+			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/* use tpause if we have it */
+			if (global_data.intrinsics_support.power_pause) {
+				const uint64_t cur = rte_rdtsc();
+				const uint64_t wait_tsc =
+						cur + global_data.tsc_per_us;
+				rte_power_pause(wait_tsc);
+			} else {
+				uint64_t i;
+				for (i = 0; i < global_data.pause_per_us; i++)
+					rte_pause();
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
+		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	struct rte_eth_dev_info info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if (rte_eth_dev_info_get(port_id, &info) < 0) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* check if queue id is valid */
+	if (queue_id >= info.nb_rx_queues) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* we need this in various places */
+	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		struct rte_power_monitor_cond dummy;
+
+		/* check if rte_power_monitor is supported */
+		if (!global_data.intrinsics_support.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_monitor_addr(port_id, queue_id,
+				&dummy) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->umwait_in_progress = false;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* ensure we update our state before callback starts */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* this is not necessary here, but do it anyway */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* figure out various time-to-tsc conversions */
+		if (global_data.tsc_per_us == 0)
+			calc_tsc();
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* this is not necessary here, but do it anyway */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+end:
+	return ret;
+}
+
+int
+rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	/* stop any callbacks from progressing */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	/* ensure we update our state before continuing */
+	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		bool exit = false;
+		do {
+			/*
+			 * we may request cancellation while the other thread
+			 * has just entered the callback but hasn't started
+			 * sleeping yet, so keep waking it up until we know it's
+			 * done sleeping.
+			 */
+			if (queue_cfg->umwait_in_progress)
+				rte_power_monitor_wakeup(lcore_id);
+			else
+				exit = true;
+		} while (!exit);
+	}
+	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..7a0ac24625
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** Use power-optimized monitoring to wait for incoming traffic */
+	RTE_POWER_MGMT_TYPE_MONITOR = 1,
+	/** Use power-optimized sleep to avoid busy polling */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Use frequency scaling when traffic is low */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Enable power management on a specified Ethernet device Rx queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue will be polled from.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management scheme to use for specified Rx queue.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Disable power management on a specified Ethernet device Rx queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue is polled from.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..f38a380212 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,9 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+
+	# added in 21.02
+	rte_power_ethdev_pmgmt_queue_disable;
+	rte_power_ethdev_pmgmt_queue_enable;
+
 };
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v18 2/2] examples/l3fwd-power: enable PMD power mgmt
  2021-01-19 16:45                     ` [dpdk-dev] [PATCH v18 0/2] Add PMD power management Anatoly Burakov
  2021-01-19 16:45                       ` [dpdk-dev] [PATCH v18 1/2] power: add PMD power management API and callback Anatoly Burakov
@ 2021-01-19 16:45                       ` Anatoly Burakov
  2021-01-20 11:50                       ` [dpdk-dev] [PATCH v19 0/4] Add PMD power management Anatoly Burakov
  2 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-19 16:45 UTC (permalink / raw)
  To: dev; +Cc: Liang Ma, David Hunt, thomas

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v12:
    - Allow selecting PMD power management scheme from command-line
    - Enforce 1 core 1 queue rule

 .../sample_app_ug/l3_forward_power_man.rst    | 35 ++++++++
 examples/l3fwd-power/main.c                   | 90 ++++++++++++++++++-
 2 files changed, 123 insertions(+), 2 deletions(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 85a78a5c1e..aaa9367fae 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
+
+PMD Power Management Mode
+-------------------------
+There is also a traffic-aware operating mode that, instead of using explicit
+power management, will use automatic PMD power management. This mode is limited
+to one queue per core, and has three available power management schemes:
+
+* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a
+  power-optimized state (subject to platform support).
+
+* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid
+  busy looping when there is no traffic.
+
+* ``scale`` - this will use frequency scaling routines available in the
+  ``librte_power`` library.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK
+Programmer's Guide for more details on PMD power management.
+
+.. code-block:: console
+
+        ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 995a3b6ad7..61fbae6c4f 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,11 +200,14 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
 
+static enum rte_power_pmd_mgmt_type pmgmt_type;
+
 enum freq_scale_hint_t
 {
 	FREQ_LOWER    =      -1,
@@ -1611,7 +1615,9 @@ print_usage(const char *prgname)
 		" follow (training_flag, high_threshold, med_threshold)\n"
 		" --telemetry: enable telemetry mode, to update"
 		" empty polls, full polls, and core busyness to telemetry\n"
-		" --interrupt-only: enable interrupt-only mode\n",
+		" --interrupt-only: enable interrupt-only mode\n"
+		" --pmd-mgmt MODE: enable PMD power management mode. "
+		"Currently supported modes: monitor, pause, scale\n",
 		prgname);
 }
 
@@ -1701,6 +1707,32 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+
+static int
+parse_pmd_mgmt_config(const char *name)
+{
+#define PMD_MGMT_MONITOR "monitor"
+#define PMD_MGMT_PAUSE   "pause"
+#define PMD_MGMT_SCALE   "scale"
+
+	if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE;
+		return 0;
+	}
+	/* unknown PMD power management mode */
+	return -1;
+}
+
 static int
 parse_ep_config(const char *q_arg)
 {
@@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 1, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				if (parse_pmd_mgmt_config(optarg) < 0) {
+					printf(" Invalid PMD power management mode: %s\n",
+							optarg);
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2671,6 +2722,13 @@ main(int argc, char **argv)
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
+
+		/* PMD power management mode can only do 1 queue per core */
+		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
+			rte_exit(EXIT_FAILURE,
+				"In PMD power management mode, only one queue per lcore is allowed\n");
+		}
+
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2708,6 +2766,16 @@ main(int argc, char **argv)
 					rte_exit(EXIT_FAILURE,
 						 "Fail to add ptype cb\n");
 			}
+
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_ethdev_pmgmt_queue_enable(
+						lcore_id, portid, queueid,
+						pmgmt_type);
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+						"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
+							ret, portid);
+			}
 		}
 	}
 
@@ -2798,6 +2866,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		/* reuse telemetry loop for PMD power management mode */
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2824,6 +2895,21 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+
+				rte_power_ethdev_pmgmt_queue_disable(lcore_id,
+						portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-19 14:17                           ` Thomas Monjalon
@ 2021-01-20 10:32                             ` Burakov, Anatoly
  2021-01-20 10:38                               ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-20 10:32 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	david.hunt, chris.macnamara

On 19-Jan-21 2:17 PM, Thomas Monjalon wrote:
> 19/01/2021 12:23, Burakov, Anatoly:
>> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote:
>>> 19/01/2021 11:29, Burakov, Anatoly:
>>>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote:
>>>>> 14/01/2021 15:46, Anatoly Burakov:
>>>>>> +struct rte_power_monitor_cond {
>>>>>> +	volatile void *addr;  /**< Address to monitor for changes */
>>>>>> +	uint64_t val;         /**< Before attempting the monitoring, the address
>>>>>> +	                       *   may be read and compared against this value.
>>>>>
>>>>> "may" be read and compared?
>>>>> Is there a case where there is no read and compare?
>>>>
>>>> Yes, if the mask is not set.
>>>
>>> If the mask is not set, the address is "read" anyway
>>> or it is only "watched" for any change?
>>>
>>> Sorry the mechanism is really not clear to me.
>>>
>>
>> The "value" is only used to avoid the sleep, i.e. to check if the write
>> has already happened. We're waiting on *a write* rather than *a value*,
>> so it's not equivalent to "wait until equal" call. It's more of a "sleep
>> until something happens".
> 
> Please make things explicit in doxygen.
> The behaviour of each case should be explained crystal clear.
> Thanks
> 
> 

It is explained in the comments to `rte_power_monitor()` call. But OK, 
i'll add more clarification for the struct too.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-20 10:32                             ` Burakov, Anatoly
@ 2021-01-20 10:38                               ` Thomas Monjalon
  2021-01-20 11:05                                 ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-20 10:38 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	david.hunt, chris.macnamara

20/01/2021 11:32, Burakov, Anatoly:
> On 19-Jan-21 2:17 PM, Thomas Monjalon wrote:
> > 19/01/2021 12:23, Burakov, Anatoly:
> >> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote:
> >>> 19/01/2021 11:29, Burakov, Anatoly:
> >>>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote:
> >>>>> 14/01/2021 15:46, Anatoly Burakov:
> >>>>>> +struct rte_power_monitor_cond {
> >>>>>> +	volatile void *addr;  /**< Address to monitor for changes */
> >>>>>> +	uint64_t val;         /**< Before attempting the monitoring, the address
> >>>>>> +	                       *   may be read and compared against this value.
> >>>>>
> >>>>> "may" be read and compared?
> >>>>> Is there a case where there is no read and compare?
> >>>>
> >>>> Yes, if the mask is not set.
> >>>
> >>> If the mask is not set, the address is "read" anyway
> >>> or it is only "watched" for any change?
> >>>
> >>> Sorry the mechanism is really not clear to me.
> >>>
> >>
> >> The "value" is only used to avoid the sleep, i.e. to check if the write
> >> has already happened. We're waiting on *a write* rather than *a value*,
> >> so it's not equivalent to "wait until equal" call. It's more of a "sleep
> >> until something happens".
> > 
> > Please make things explicit in doxygen.
> > The behaviour of each case should be explained crystal clear.
> > Thanks
> 
> It is explained in the comments to `rte_power_monitor()` call. But OK, 
> i'll add more clarification for the struct too.

Please avoid the word "may" in API description.

This is what is explained in rte_power_monitor:
"
 * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
 * mask is non-zero, the current value pointed to by the `p` pointer will be
 * checked against the expected value, and if they match, the entering of
 * optimized power state may be aborted.
"

Can we replace "may" by "will"?



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-20 10:38                               ` Thomas Monjalon
@ 2021-01-20 11:05                                 ` Burakov, Anatoly
  2021-01-20 11:11                                   ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-20 11:05 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	david.hunt, chris.macnamara

On 20-Jan-21 10:38 AM, Thomas Monjalon wrote:
> 20/01/2021 11:32, Burakov, Anatoly:
>> On 19-Jan-21 2:17 PM, Thomas Monjalon wrote:
>>> 19/01/2021 12:23, Burakov, Anatoly:
>>>> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote:
>>>>> 19/01/2021 11:29, Burakov, Anatoly:
>>>>>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote:
>>>>>>> 14/01/2021 15:46, Anatoly Burakov:
>>>>>>>> +struct rte_power_monitor_cond {
>>>>>>>> +	volatile void *addr;  /**< Address to monitor for changes */
>>>>>>>> +	uint64_t val;         /**< Before attempting the monitoring, the address
>>>>>>>> +	                       *   may be read and compared against this value.
>>>>>>>
>>>>>>> "may" be read and compared?
>>>>>>> Is there a case where there is no read and compare?
>>>>>>
>>>>>> Yes, if the mask is not set.
>>>>>
>>>>> If the mask is not set, the address is "read" anyway
>>>>> or it is only "watched" for any change?
>>>>>
>>>>> Sorry the mechanism is really not clear to me.
>>>>>
>>>>
>>>> The "value" is only used to avoid the sleep, i.e. to check if the write
>>>> has already happened. We're waiting on *a write* rather than *a value*,
>>>> so it's not equivalent to "wait until equal" call. It's more of a "sleep
>>>> until something happens".
>>>
>>> Please make things explicit in doxygen.
>>> The behaviour of each case should be explained crystal clear.
>>> Thanks
>>
>> It is explained in the comments to `rte_power_monitor()` call. But OK,
>> i'll add more clarification for the struct too.
> 
> Please avoid the word "may" in API description.
> 
> This is what is explained in rte_power_monitor:
> "
>   * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
>   * mask is non-zero, the current value pointed to by the `p` pointer will be
>   * checked against the expected value, and if they match, the entering of
>   * optimized power state may be aborted.
> "
> 
> Can we replace "may" by "will"?
> 

Yep, we can. However, the "may" part was intended to leave some wiggle 
room for a different implementation, should the need arise, and i find 
"will" to be needlessly prescriptive. Frankly, i do not see the need for 
such a detailed description of what the API does under the hood, as long 
as it's clear what its effects are. The main purpose is waiting for a 
write. The mask is only used to check whether the expected write has 
already happened by the time we're calling the API. Whether the CPU then 
does or does not go to sleep is not really relevant IMO.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-20 11:05                                 ` Burakov, Anatoly
@ 2021-01-20 11:11                                   ` Thomas Monjalon
  2021-01-20 11:17                                     ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-20 11:11 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	david.hunt, chris.macnamara, david.marchand, jerinj,
	ajit.khaparde, honnappa.nagarahalli, David Christensen

20/01/2021 12:05, Burakov, Anatoly:
> On 20-Jan-21 10:38 AM, Thomas Monjalon wrote:
> > 20/01/2021 11:32, Burakov, Anatoly:
> >> On 19-Jan-21 2:17 PM, Thomas Monjalon wrote:
> >>> 19/01/2021 12:23, Burakov, Anatoly:
> >>>> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote:
> >>>>> 19/01/2021 11:29, Burakov, Anatoly:
> >>>>>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote:
> >>>>>>> 14/01/2021 15:46, Anatoly Burakov:
> >>>>>>>> +struct rte_power_monitor_cond {
> >>>>>>>> +	volatile void *addr;  /**< Address to monitor for changes */
> >>>>>>>> +	uint64_t val;         /**< Before attempting the monitoring, the address
> >>>>>>>> +	                       *   may be read and compared against this value.
> >>>>>>>
> >>>>>>> "may" be read and compared?
> >>>>>>> Is there a case where there is no read and compare?
> >>>>>>
> >>>>>> Yes, if the mask is not set.
> >>>>>
> >>>>> If the mask is not set, the address is "read" anyway
> >>>>> or it is only "watched" for any change?
> >>>>>
> >>>>> Sorry the mechanism is really not clear to me.
> >>>>>
> >>>>
> >>>> The "value" is only used to avoid the sleep, i.e. to check if the write
> >>>> has already happened. We're waiting on *a write* rather than *a value*,
> >>>> so it's not equivalent to "wait until equal" call. It's more of a "sleep
> >>>> until something happens".
> >>>
> >>> Please make things explicit in doxygen.
> >>> The behaviour of each case should be explained crystal clear.
> >>> Thanks
> >>
> >> It is explained in the comments to `rte_power_monitor()` call. But OK,
> >> i'll add more clarification for the struct too.
> > 
> > Please avoid the word "may" in API description.
> > 
> > This is what is explained in rte_power_monitor:
> > "
> >   * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> >   * mask is non-zero, the current value pointed to by the `p` pointer will be
> >   * checked against the expected value, and if they match, the entering of
> >   * optimized power state may be aborted.
> > "
> > 
> > Can we replace "may" by "will"?
> > 
> 
> Yep, we can. However, the "may" part was intended to leave some wiggle 
> room for a different implementation, should the need arise, and i find 
> "will" to be needlessly prescriptive. Frankly, i do not see the need for 
> such a detailed description of what the API does under the hood, as long 
> as it's clear what its effects are. The main purpose is waiting for a 
> write. The mask is only used to check whether the expected write has 
> already happened by the time we're calling the API. Whether the CPU then 
> does or does not go to sleep is not really relevant IMO.

I think it is relevant but I may be wrong.
Any other opinions?



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics
  2021-01-20 11:11                                   ` Thomas Monjalon
@ 2021-01-20 11:17                                     ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-20 11:17 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	david.hunt, chris.macnamara, david.marchand, ajit.khaparde,
	honnappa.nagarahalli

On 20-Jan-21 11:11 AM, Thomas Monjalon wrote:
> 20/01/2021 12:05, Burakov, Anatoly:
>> On 20-Jan-21 10:38 AM, Thomas Monjalon wrote:
>>> 20/01/2021 11:32, Burakov, Anatoly:
>>>> On 19-Jan-21 2:17 PM, Thomas Monjalon wrote:
>>>>> 19/01/2021 12:23, Burakov, Anatoly:
>>>>>> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote:
>>>>>>> 19/01/2021 11:29, Burakov, Anatoly:
>>>>>>>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote:
>>>>>>>>> 14/01/2021 15:46, Anatoly Burakov:
>>>>>>>>>> +struct rte_power_monitor_cond {
>>>>>>>>>> +	volatile void *addr;  /**< Address to monitor for changes */
>>>>>>>>>> +	uint64_t val;         /**< Before attempting the monitoring, the address
>>>>>>>>>> +	                       *   may be read and compared against this value.
>>>>>>>>>
>>>>>>>>> "may" be read and compared?
>>>>>>>>> Is there a case where there is no read and compare?
>>>>>>>>
>>>>>>>> Yes, if the mask is not set.
>>>>>>>
>>>>>>> If the mask is not set, the address is "read" anyway
>>>>>>> or it is only "watched" for any change?
>>>>>>>
>>>>>>> Sorry the mechanism is really not clear to me.
>>>>>>>
>>>>>>
>>>>>> The "value" is only used to avoid the sleep, i.e. to check if the write
>>>>>> has already happened. We're waiting on *a write* rather than *a value*,
>>>>>> so it's not equivalent to "wait until equal" call. It's more of a "sleep
>>>>>> until something happens".
>>>>>
>>>>> Please make things explicit in doxygen.
>>>>> The behaviour of each case should be explained crystal clear.
>>>>> Thanks
>>>>
>>>> It is explained in the comments to `rte_power_monitor()` call. But OK,
>>>> i'll add more clarification for the struct too.
>>>
>>> Please avoid the word "may" in API description.
>>>
>>> This is what is explained in rte_power_monitor:
>>> "
>>>    * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
>>>    * mask is non-zero, the current value pointed to by the `p` pointer will be
>>>    * checked against the expected value, and if they match, the entering of
>>>    * optimized power state may be aborted.
>>> "
>>>
>>> Can we replace "may" by "will"?
>>>
>>
>> Yep, we can. However, the "may" part was intended to leave some wiggle
>> room for a different implementation, should the need arise, and i find
>> "will" to be needlessly prescriptive. Frankly, i do not see the need for
>> such a detailed description of what the API does under the hood, as long
>> as it's clear what its effects are. The main purpose is waiting for a
>> write. The mask is only used to check whether the expected write has
>> already happened by the time we're calling the API. Whether the CPU then
>> does or does not go to sleep is not really relevant IMO.
> 
> I think it is relevant but I may be wrong.
> Any other opinions?
> 

I have no objection in documenting that further, so i'll go ahead and do 
it :) It's just that i think it's not necessary to be *this* detailed 
about how the API does things.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v19 0/4] Add PMD power management
  2021-01-19 16:45                     ` [dpdk-dev] [PATCH v18 0/2] Add PMD power management Anatoly Burakov
  2021-01-19 16:45                       ` [dpdk-dev] [PATCH v18 1/2] power: add PMD power management API and callback Anatoly Burakov
  2021-01-19 16:45                       ` [dpdk-dev] [PATCH v18 2/2] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2021-01-20 11:50                       ` Anatoly Burakov
  2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 1/4] eal: rename power monitor condition member Anatoly Burakov
                                           ` (4 more replies)
  2 siblings, 5 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-20 11:50 UTC (permalink / raw)
  To: dev; +Cc: thomas

This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. There are multiple proposed mechanisms to achieve said power
savings: simple frequency scaling, idle loop, and monitoring the Rx
queue for incoming packages. The latter is achieved through cooperation
with the NIC driver that will allow us to know address of wake up event,
and wait for writes on that address.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we employ
one of the suggested power management schemes automatically, from within
a Rx callback inside the PMD. Once there's traffic again, the empty poll
counter is reset.

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it with
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

Things of note:

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

v19:
- Renamed "data_sz" to "size" and clarified struct comments
- Clarified documentation around rte_power_monitor/pause API

v18:
- Rebase on top of latest main
- Address review comments by Thomas

v17:
- Added exception for ethdev driver-only ABI
- Added memory barriers for monitor/wakeup (Konstantin)
- Fixed compiled issues on non-x86 platforms (hopefully!)

v16:
- Implemented Konstantin's suggestions and comments
- Added return values to the API

v15:
- Fixed incorrect check in UMWAIT callback
- Fixed accidental whitespace changes

v14:
- Fixed ARM/PPC builds
- Addressed various review comments

v13:
- Reworked the librte_power code to require less locking and handle invalid
  parameters better
- Fix numerous rebase errors present in v12

v12:
- Rebase on top of 21.02
- Rework of power intrinsics code

Anatoly Burakov (2):
  eal: rename power monitor condition member
  eal: improve comments around power monitoring API

Liang Ma (2):
  power: add PMD power management API and callback
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  41 ++
 doc/guides/rel_notes/release_21_02.rst        |  10 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 drivers/event/dlb/dlb.c                       |   2 +-
 drivers/event/dlb2/dlb2.c                     |   2 +-
 drivers/net/i40e/i40e_rxtx.c                  |   2 +-
 drivers/net/ice/ice_rxtx.c                    |   2 +-
 drivers/net/ixgbe/ixgbe_rxtx.c                |   2 +-
 examples/l3fwd-power/main.c                   |  90 ++++-
 .../include/generic/rte_power_intrinsics.h    |  39 +-
 lib/librte_eal/x86/rte_power_intrinsics.c     |   4 +-
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 365 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  91 +++++
 lib/librte_power/version.map                  |   5 +
 15 files changed, 669 insertions(+), 26 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v19 1/4] eal: rename power monitor condition member
  2021-01-20 11:50                       ` [dpdk-dev] [PATCH v19 0/4] Add PMD power management Anatoly Burakov
@ 2021-01-20 11:50                         ` Anatoly Burakov
  2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 2/4] eal: improve comments around power monitoring API Anatoly Burakov
                                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-20 11:50 UTC (permalink / raw)
  To: dev
  Cc: Timothy McDaniel, Beilei Xing, Jeff Guo, Qiming Yang, Qi Zhang,
	Haiyue Wang, Bruce Richardson, Konstantin Ananyev, thomas

The `data_sz` name is fine, but it looks out of place because nothing
else has "data" prefix in that structure. Rename it to "size", as well
as add more clarity to the comments around each struct member.

Fixes: 6a17919b0e2a ("eal: change power intrinsics API")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/event/dlb/dlb.c                       |  2 +-
 drivers/event/dlb2/dlb2.c                     |  2 +-
 drivers/net/i40e/i40e_rxtx.c                  |  2 +-
 drivers/net/ice/ice_rxtx.c                    |  2 +-
 drivers/net/ixgbe/ixgbe_rxtx.c                |  2 +-
 .../include/generic/rte_power_intrinsics.h    | 19 +++++++++++--------
 lib/librte_eal/x86/rte_power_intrinsics.c     |  4 ++--
 7 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c
index d2f2026291..a65f70882f 100644
--- a/drivers/event/dlb/dlb.c
+++ b/drivers/event/dlb/dlb.c
@@ -3185,7 +3185,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		pmc.addr = monitor_addr;
 		pmc.val = expected_value;
 		pmc.mask = qe_mask.raw_qe[1];
-		pmc.data_sz = sizeof(uint64_t);
+		pmc.size = sizeof(uint64_t);
 
 		rte_power_monitor(&pmc, timeout + start_ticks);
 
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index c9a8a02278..5782960158 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -2894,7 +2894,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		pmc.addr = monitor_addr;
 		pmc.val = expected_value;
 		pmc.mask = qe_mask.raw_qe[1];
-		pmc.data_sz = sizeof(uint64_t);
+		pmc.size = sizeof(uint64_t);
 
 		rte_power_monitor(&pmc, timeout + start_ticks);
 
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 0b4220fc9c..d8e9db55d8 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -92,7 +92,7 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
 
 	/* registers are 64-bit */
-	pmc->data_sz = sizeof(uint64_t);
+	pmc->size = sizeof(uint64_t);
 
 	return 0;
 }
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 066651dc48..5909e3707b 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -46,7 +46,7 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
 
 	/* register is 16-bit */
-	pmc->data_sz = sizeof(uint16_t);
+	pmc->size = sizeof(uint16_t);
 
 	return 0;
 }
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index cc8f70e6dd..c0305a8238 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1389,7 +1389,7 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
 
 	/* the registers are 32-bit */
-	pmc->data_sz = sizeof(uint32_t);
+	pmc->size = sizeof(uint32_t);
 
 	return 0;
 }
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 6109d28faa..5960c48c80 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -20,14 +20,17 @@
 
 struct rte_power_monitor_cond {
 	volatile void *addr;  /**< Address to monitor for changes */
-	uint64_t val;         /**< Before attempting the monitoring, the address
-	                       *   may be read and compared against this value.
-	                       **/
-	uint64_t mask;   /**< 64-bit mask to extract current value from addr */
-	uint8_t data_sz; /**< Data size (in bytes) that will be used to compare
-	                  *   expected value with the memory address. Can be 1,
-	                  *   2, 4, or 8. Supplying any other value will lead to
-	                  *   undefined result. */
+	uint64_t val;         /**< If the `mask` is non-zero, location pointed
+	                       *   to by `addr` will be read and compared
+	                       *   against this value.
+	                       */
+	uint64_t mask;   /**< 64-bit mask to extract value read from `addr` */
+	uint8_t size;    /**< Data size (in bytes) that will be used to compare
+	                  *   expected value (`val`) with data read from the
+	                  *   monitored memory location (`addr`). Can be 1, 2,
+	                  *   4, or 8. Supplying any other value will result in
+	                  *   an error.
+	                  */
 };
 
 /**
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index af3ae3237c..39ea9fdecd 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -88,7 +88,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc == NULL)
 		return -EINVAL;
 
-	if (__check_val_size(pmc->data_sz) < 0)
+	if (__check_val_size(pmc->size) < 0)
 		return -EINVAL;
 
 	s = &wait_status[lcore_id];
@@ -113,7 +113,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	/* if we have a comparison mask, we might not need to sleep at all */
 	if (pmc->mask) {
 		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->data_sz);
+				pmc->addr, pmc->size);
 		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v19 2/4] eal: improve comments around power monitoring API
  2021-01-20 11:50                       ` [dpdk-dev] [PATCH v19 0/4] Add PMD power management Anatoly Burakov
  2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 1/4] eal: rename power monitor condition member Anatoly Burakov
@ 2021-01-20 11:50                         ` Anatoly Burakov
  2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 3/4] power: add PMD power management API and callback Anatoly Burakov
                                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-20 11:50 UTC (permalink / raw)
  To: dev; +Cc: thomas

Currently, the API documentation is ambiguous as to what happens when
certain conditions are met. Document the behavior explicitly, as well as
fix some typos and outdated comments.

Fixes: 6a17919b0e2a ("eal: change power intrinsics API")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    | 20 ++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 5960c48c80..dddca3d41c 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -35,17 +35,20 @@ struct rte_power_monitor_cond {
 
 /**
  * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
+ * @b EXPERIMENTAL: this API may change without prior notice.
  *
  * Monitor specific address for changes. This will cause the CPU to enter an
  * architecture-defined optimized power state until either the specified
  * memory address is written to, a certain TSC timestamp is reached, or other
  * reasons cause the CPU to wake up.
  *
- * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
- * mask is non-zero, the current value pointed to by the `p` pointer will be
- * checked against the expected value, and if they match, the entering of
- * optimized power state may be aborted.
+ * Additionally, an expected value (`pmc->val`), mask (`pmc->mask`), and data
+ * size (`pmc->size`) are provided in the `pmc` power monitoring condition. If
+ * the mask is non-zero, the current value pointed to by the `pmc->addr` pointer
+ * will be read and compared against the expected value, and if they match, the
+ * entering of optimized power state will be aborted. This is intended to
+ * prevent the CPU from entering optimized power state and waiting on a write
+ * that has already happened by the time this API is called.
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
@@ -67,11 +70,14 @@ int rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 
 /**
  * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
+ * @b EXPERIMENTAL: this API may change without prior notice.
  *
  * Wake up a specific lcore that is in a power optimized state and is monitoring
  * an address.
  *
+ * @note It is safe to call this function if the lcore in question is not
+ *   sleeping. The function will have no effect.
+ *
  * @note This function will *not* wake up a core that is in a power optimized
  *   state due to calling `rte_power_pause`.
  *
@@ -83,7 +89,7 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
 
 /**
  * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
+ * @b EXPERIMENTAL: this API may change without prior notice.
  *
  * Enter an architecture-defined optimized power state until a certain TSC
  * timestamp is reached.
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v19 3/4] power: add PMD power management API and callback
  2021-01-20 11:50                       ` [dpdk-dev] [PATCH v19 0/4] Add PMD power management Anatoly Burakov
  2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 1/4] eal: rename power monitor condition member Anatoly Burakov
  2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 2/4] eal: improve comments around power monitoring API Anatoly Burakov
@ 2021-01-20 11:50                         ` Anatoly Burakov
  2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 4/4] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  2021-01-22 17:12                         ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Anatoly Burakov
  4 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-20 11:50 UTC (permalink / raw)
  To: dev; +Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. TPAUSE/Pause instruction

   This method uses the pause (or TPAUSE, if available) instruction to
   avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v17:
    - Added memory barriers suggested by Konstantin
    - Removed the BUSY state

 doc/guides/prog_guide/power_man.rst    |  41 +++
 doc/guides/rel_notes/release_21_02.rst |  10 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 365 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  91 ++++++
 lib/librte_power/version.map           |   5 +
 6 files changed, 515 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..f36ba0027c 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,47 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+Ethernet PMD Power Management API
+---------------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * Monitor
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX
+   descriptor address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not available, use ``rte_pause()``.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing ``librte_power`` library
+   functionality to scale the core frequency up/down depending on traffic
+   volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for Ethernet PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core.
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core.
+
 References
 ----------
 
diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index be5ea4370c..1988960b76 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -76,6 +76,16 @@ New Features
 
   * Added inner UDP/IPv4 support for VXLAN IPv4 GSO.
 
+* **Added Ethernet PMD power management helper API.**
+
+  A new helper API has been added to make using Ethernet PMD power management
+  easier for the user: ``rte_power_ethdev_pmgmt_queue_enable()``. Three power
+  management schemes are supported initially:
+
+  * Power saving based on UMWAIT instruction (x86 only)
+  * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only)
+  * Power saving based on frequency scaling through the ``librte_power`` library
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 4b4cf1b90b..e5a11cb834 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer', 'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..454ef7091e
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,365 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+/* store some internal state */
+static struct pmd_conf_data {
+	/** what do we support? */
+	struct rte_cpu_intrinsics intrinsics_support;
+	/** pre-calculated tsc diff for 1us */
+	uint64_t tsc_per_us;
+	/** how many rte_pause can we fit in a microsecond? */
+	uint64_t pause_per_us;
+} global_data;
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED
+};
+
+struct pmd_queue_cfg {
+	volatile enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	volatile bool umwait_in_progress;
+	/**< are we currently sleeping? */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+static void
+calc_tsc(void)
+{
+	const uint64_t hz = rte_get_timer_hz();
+	const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */
+
+	global_data.tsc_per_us = tsc_per_us;
+
+	/* only do this if we don't have tpause */
+	if (!global_data.intrinsics_support.power_pause) {
+		const uint64_t start = rte_rdtsc_precise();
+		const uint32_t n_pauses = 10000;
+		double us, us_per_pause;
+		uint64_t end;
+		unsigned int i;
+
+		/* estimate number of rte_pause() calls per us*/
+		for (i = 0; i < n_pauses; i++)
+			rte_pause();
+
+		end = rte_rdtsc_precise();
+		us = (end - start) / (double)tsc_per_us;
+		us_per_pause = us / n_pauses;
+
+		global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause);
+	}
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc;
+			uint16_t ret;
+
+			/*
+			 * we might get a cancellation request while being
+			 * inside the callback, in which case the wakeup
+			 * wouldn't work because it would've arrived too early.
+			 *
+			 * to get around this, we notify the other thread that
+			 * we're sleeping, so that it can spin until we're done.
+			 * unsolicited wakeups are perfectly safe.
+			 */
+			q_conf->umwait_in_progress = true;
+
+			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+			/* check if we need to cancel sleep */
+			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+				/* use monitoring condition to sleep */
+				ret = rte_eth_get_monitor_addr(port_id, qidx,
+						&pmc);
+				if (ret == 0)
+					rte_power_monitor(&pmc, -1ULL);
+			}
+			q_conf->umwait_in_progress = false;
+
+			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/* use tpause if we have it */
+			if (global_data.intrinsics_support.power_pause) {
+				const uint64_t cur = rte_rdtsc();
+				const uint64_t wait_tsc =
+						cur + global_data.tsc_per_us;
+				rte_power_pause(wait_tsc);
+			} else {
+				uint64_t i;
+				for (i = 0; i < global_data.pause_per_us; i++)
+					rte_pause();
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
+		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	struct rte_eth_dev_info info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if (rte_eth_dev_info_get(port_id, &info) < 0) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* check if queue id is valid */
+	if (queue_id >= info.nb_rx_queues) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* we need this in various places */
+	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		struct rte_power_monitor_cond dummy;
+
+		/* check if rte_power_monitor is supported */
+		if (!global_data.intrinsics_support.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_monitor_addr(port_id, queue_id,
+				&dummy) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->umwait_in_progress = false;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* ensure we update our state before callback starts */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* this is not necessary here, but do it anyway */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* figure out various time-to-tsc conversions */
+		if (global_data.tsc_per_us == 0)
+			calc_tsc();
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* this is not necessary here, but do it anyway */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+end:
+	return ret;
+}
+
+int
+rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	/* stop any callbacks from progressing */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	/* ensure we update our state before continuing */
+	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		bool exit = false;
+		do {
+			/*
+			 * we may request cancellation while the other thread
+			 * has just entered the callback but hasn't started
+			 * sleeping yet, so keep waking it up until we know it's
+			 * done sleeping.
+			 */
+			if (queue_cfg->umwait_in_progress)
+				rte_power_monitor_wakeup(lcore_id);
+			else
+				exit = true;
+		} while (!exit);
+	}
+	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..7a0ac24625
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** Use power-optimized monitoring to wait for incoming traffic */
+	RTE_POWER_MGMT_TYPE_MONITOR = 1,
+	/** Use power-optimized sleep to avoid busy polling */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Use frequency scaling when traffic is low */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Enable power management on a specified Ethernet device Rx queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue will be polled from.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management scheme to use for specified Rx queue.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Disable power management on a specified Ethernet device Rx queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue is polled from.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..f38a380212 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,9 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+
+	# added in 21.02
+	rte_power_ethdev_pmgmt_queue_disable;
+	rte_power_ethdev_pmgmt_queue_enable;
+
 };
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v19 4/4] examples/l3fwd-power: enable PMD power mgmt
  2021-01-20 11:50                       ` [dpdk-dev] [PATCH v19 0/4] Add PMD power management Anatoly Burakov
                                           ` (2 preceding siblings ...)
  2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 3/4] power: add PMD power management API and callback Anatoly Burakov
@ 2021-01-20 11:50                         ` Anatoly Burakov
  2021-01-22 17:12                         ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Anatoly Burakov
  4 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-20 11:50 UTC (permalink / raw)
  To: dev; +Cc: Liang Ma, David Hunt, thomas

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v12:
    - Allow selecting PMD power management scheme from command-line
    - Enforce 1 core 1 queue rule

 .../sample_app_ug/l3_forward_power_man.rst    | 35 ++++++++
 examples/l3fwd-power/main.c                   | 90 ++++++++++++++++++-
 2 files changed, 123 insertions(+), 2 deletions(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 85a78a5c1e..aaa9367fae 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
+
+PMD Power Management Mode
+-------------------------
+There is also a traffic-aware operating mode that, instead of using explicit
+power management, will use automatic PMD power management. This mode is limited
+to one queue per core, and has three available power management schemes:
+
+* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a
+  power-optimized state (subject to platform support).
+
+* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid
+  busy looping when there is no traffic.
+
+* ``scale`` - this will use frequency scaling routines available in the
+  ``librte_power`` library.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK
+Programmer's Guide for more details on PMD power management.
+
+.. code-block:: console
+
+        ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 995a3b6ad7..61fbae6c4f 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,11 +200,14 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
 
+static enum rte_power_pmd_mgmt_type pmgmt_type;
+
 enum freq_scale_hint_t
 {
 	FREQ_LOWER    =      -1,
@@ -1611,7 +1615,9 @@ print_usage(const char *prgname)
 		" follow (training_flag, high_threshold, med_threshold)\n"
 		" --telemetry: enable telemetry mode, to update"
 		" empty polls, full polls, and core busyness to telemetry\n"
-		" --interrupt-only: enable interrupt-only mode\n",
+		" --interrupt-only: enable interrupt-only mode\n"
+		" --pmd-mgmt MODE: enable PMD power management mode. "
+		"Currently supported modes: monitor, pause, scale\n",
 		prgname);
 }
 
@@ -1701,6 +1707,32 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+
+static int
+parse_pmd_mgmt_config(const char *name)
+{
+#define PMD_MGMT_MONITOR "monitor"
+#define PMD_MGMT_PAUSE   "pause"
+#define PMD_MGMT_SCALE   "scale"
+
+	if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE;
+		return 0;
+	}
+	/* unknown PMD power management mode */
+	return -1;
+}
+
 static int
 parse_ep_config(const char *q_arg)
 {
@@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 1, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				if (parse_pmd_mgmt_config(optarg) < 0) {
+					printf(" Invalid PMD power management mode: %s\n",
+							optarg);
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2671,6 +2722,13 @@ main(int argc, char **argv)
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
+
+		/* PMD power management mode can only do 1 queue per core */
+		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
+			rte_exit(EXIT_FAILURE,
+				"In PMD power management mode, only one queue per lcore is allowed\n");
+		}
+
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2708,6 +2766,16 @@ main(int argc, char **argv)
 					rte_exit(EXIT_FAILURE,
 						 "Fail to add ptype cb\n");
 			}
+
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_ethdev_pmgmt_queue_enable(
+						lcore_id, portid, queueid,
+						pmgmt_type);
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+						"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
+							ret, portid);
+			}
 		}
 	}
 
@@ -2798,6 +2866,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		/* reuse telemetry loop for PMD power management mode */
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2824,6 +2895,21 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+
+				rte_power_ethdev_pmgmt_queue_disable(lcore_id,
+						portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v20 0/4] Add PMD power management
  2021-01-20 11:50                       ` [dpdk-dev] [PATCH v19 0/4] Add PMD power management Anatoly Burakov
                                           ` (3 preceding siblings ...)
  2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 4/4] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2021-01-22 17:12                         ` Anatoly Burakov
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 1/4] eal: rename power monitor condition member Anatoly Burakov
                                             ` (4 more replies)
  4 siblings, 5 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-22 17:12 UTC (permalink / raw)
  To: dev; +Cc: thomas

This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. There are multiple proposed mechanisms to achieve said power
savings: simple frequency scaling, idle loop, and monitoring the Rx
queue for incoming packages. The latter is achieved through cooperation
with the NIC driver that will allow us to know address of wake up event,
and wait for writes on that address.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we employ
one of the suggested power management schemes automatically, from within
a Rx callback inside the PMD. Once there's traffic again, the empty poll
counter is reset.

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it with
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

Things of note:

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

v20:
- Moved callback removal before port close

v19:
- Renamed "data_sz" to "size" and clarified struct comments
- Clarified documentation around rte_power_monitor/pause API

v18:
- Rebase on top of latest main
- Address review comments by Thomas

v17:
- Added exception for ethdev driver-only ABI
- Added memory barriers for monitor/wakeup (Konstantin)
- Fixed compiled issues on non-x86 platforms (hopefully!)

v16:
- Implemented Konstantin's suggestions and comments
- Added return values to the API

v15:
- Fixed incorrect check in UMWAIT callback
- Fixed accidental whitespace changes

v14:
- Fixed ARM/PPC builds
- Addressed various review comments

v13:
- Reworked the librte_power code to require less locking and handle invalid
  parameters better
- Fix numerous rebase errors present in v12

v12:
- Rebase on top of 21.02
- Rework of power intrinsics code

Anatoly Burakov (2):
  eal: rename power monitor condition member
  eal: improve comments around power monitoring API

Liang Ma (2):
  power: add PMD power management API and callback
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  41 ++
 doc/guides/rel_notes/release_21_02.rst        |  10 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 drivers/event/dlb/dlb.c                       |   2 +-
 drivers/event/dlb2/dlb2.c                     |   2 +-
 drivers/net/i40e/i40e_rxtx.c                  |   2 +-
 drivers/net/ice/ice_rxtx.c                    |   2 +-
 drivers/net/ixgbe/ixgbe_rxtx.c                |   2 +-
 examples/l3fwd-power/main.c                   |  90 ++++-
 .../include/generic/rte_power_intrinsics.h    |  39 +-
 lib/librte_eal/x86/rte_power_intrinsics.c     |   4 +-
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 365 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  91 +++++
 lib/librte_power/version.map                  |   5 +
 15 files changed, 669 insertions(+), 26 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v20 1/4] eal: rename power monitor condition member
  2021-01-22 17:12                         ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Anatoly Burakov
@ 2021-01-22 17:12                           ` Anatoly Burakov
  2021-01-29 11:26                             ` Thomas Monjalon
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 2/4] eal: improve comments around power monitoring API Anatoly Burakov
                                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-22 17:12 UTC (permalink / raw)
  To: dev
  Cc: Timothy McDaniel, Beilei Xing, Jeff Guo, Qiming Yang, Qi Zhang,
	Haiyue Wang, Bruce Richardson, Konstantin Ananyev, thomas

The `data_sz` name is fine, but it looks out of place because nothing
else has "data" prefix in that structure. Rename it to "size", as well
as add more clarity to the comments around each struct member.

Fixes: 6a17919b0e2a ("eal: change power intrinsics API")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/event/dlb/dlb.c                       |  2 +-
 drivers/event/dlb2/dlb2.c                     |  2 +-
 drivers/net/i40e/i40e_rxtx.c                  |  2 +-
 drivers/net/ice/ice_rxtx.c                    |  2 +-
 drivers/net/ixgbe/ixgbe_rxtx.c                |  2 +-
 .../include/generic/rte_power_intrinsics.h    | 19 +++++++++++--------
 lib/librte_eal/x86/rte_power_intrinsics.c     |  4 ++--
 7 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c
index d2f2026291..a65f70882f 100644
--- a/drivers/event/dlb/dlb.c
+++ b/drivers/event/dlb/dlb.c
@@ -3185,7 +3185,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb,
 		pmc.addr = monitor_addr;
 		pmc.val = expected_value;
 		pmc.mask = qe_mask.raw_qe[1];
-		pmc.data_sz = sizeof(uint64_t);
+		pmc.size = sizeof(uint64_t);
 
 		rte_power_monitor(&pmc, timeout + start_ticks);
 
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index c9a8a02278..5782960158 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -2894,7 +2894,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		pmc.addr = monitor_addr;
 		pmc.val = expected_value;
 		pmc.mask = qe_mask.raw_qe[1];
-		pmc.data_sz = sizeof(uint64_t);
+		pmc.size = sizeof(uint64_t);
 
 		rte_power_monitor(&pmc, timeout + start_ticks);
 
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 89560d4ee5..668edd6626 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -92,7 +92,7 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
 
 	/* registers are 64-bit */
-	pmc->data_sz = sizeof(uint64_t);
+	pmc->size = sizeof(uint64_t);
 
 	return 0;
 }
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 7286e3a445..69f994579a 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -46,7 +46,7 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
 
 	/* register is 16-bit */
-	pmc->data_sz = sizeof(uint16_t);
+	pmc->size = sizeof(uint16_t);
 
 	return 0;
 }
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index cc8f70e6dd..c0305a8238 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1389,7 +1389,7 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
 
 	/* the registers are 32-bit */
-	pmc->data_sz = sizeof(uint32_t);
+	pmc->size = sizeof(uint32_t);
 
 	return 0;
 }
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 6109d28faa..5960c48c80 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -20,14 +20,17 @@
 
 struct rte_power_monitor_cond {
 	volatile void *addr;  /**< Address to monitor for changes */
-	uint64_t val;         /**< Before attempting the monitoring, the address
-	                       *   may be read and compared against this value.
-	                       **/
-	uint64_t mask;   /**< 64-bit mask to extract current value from addr */
-	uint8_t data_sz; /**< Data size (in bytes) that will be used to compare
-	                  *   expected value with the memory address. Can be 1,
-	                  *   2, 4, or 8. Supplying any other value will lead to
-	                  *   undefined result. */
+	uint64_t val;         /**< If the `mask` is non-zero, location pointed
+	                       *   to by `addr` will be read and compared
+	                       *   against this value.
+	                       */
+	uint64_t mask;   /**< 64-bit mask to extract value read from `addr` */
+	uint8_t size;    /**< Data size (in bytes) that will be used to compare
+	                  *   expected value (`val`) with data read from the
+	                  *   monitored memory location (`addr`). Can be 1, 2,
+	                  *   4, or 8. Supplying any other value will result in
+	                  *   an error.
+	                  */
 };
 
 /**
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index af3ae3237c..39ea9fdecd 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -88,7 +88,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc == NULL)
 		return -EINVAL;
 
-	if (__check_val_size(pmc->data_sz) < 0)
+	if (__check_val_size(pmc->size) < 0)
 		return -EINVAL;
 
 	s = &wait_status[lcore_id];
@@ -113,7 +113,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	/* if we have a comparison mask, we might not need to sleep at all */
 	if (pmc->mask) {
 		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->data_sz);
+				pmc->addr, pmc->size);
 		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v20 2/4] eal: improve comments around power monitoring API
  2021-01-22 17:12                         ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Anatoly Burakov
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 1/4] eal: rename power monitor condition member Anatoly Burakov
@ 2021-01-22 17:12                           ` Anatoly Burakov
  2021-01-29 11:27                             ` Thomas Monjalon
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 3/4] power: add PMD power management API and callback Anatoly Burakov
                                             ` (2 subsequent siblings)
  4 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-22 17:12 UTC (permalink / raw)
  To: dev; +Cc: thomas

Currently, the API documentation is ambiguous as to what happens when
certain conditions are met. Document the behavior explicitly, as well as
fix some typos and outdated comments.

Fixes: 6a17919b0e2a ("eal: change power intrinsics API")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    | 20 ++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 5960c48c80..dddca3d41c 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -35,17 +35,20 @@ struct rte_power_monitor_cond {
 
 /**
  * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
+ * @b EXPERIMENTAL: this API may change without prior notice.
  *
  * Monitor specific address for changes. This will cause the CPU to enter an
  * architecture-defined optimized power state until either the specified
  * memory address is written to, a certain TSC timestamp is reached, or other
  * reasons cause the CPU to wake up.
  *
- * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
- * mask is non-zero, the current value pointed to by the `p` pointer will be
- * checked against the expected value, and if they match, the entering of
- * optimized power state may be aborted.
+ * Additionally, an expected value (`pmc->val`), mask (`pmc->mask`), and data
+ * size (`pmc->size`) are provided in the `pmc` power monitoring condition. If
+ * the mask is non-zero, the current value pointed to by the `pmc->addr` pointer
+ * will be read and compared against the expected value, and if they match, the
+ * entering of optimized power state will be aborted. This is intended to
+ * prevent the CPU from entering optimized power state and waiting on a write
+ * that has already happened by the time this API is called.
  *
  * @warning It is responsibility of the user to check if this function is
  *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
@@ -67,11 +70,14 @@ int rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 
 /**
  * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
+ * @b EXPERIMENTAL: this API may change without prior notice.
  *
  * Wake up a specific lcore that is in a power optimized state and is monitoring
  * an address.
  *
+ * @note It is safe to call this function if the lcore in question is not
+ *   sleeping. The function will have no effect.
+ *
  * @note This function will *not* wake up a core that is in a power optimized
  *   state due to calling `rte_power_pause`.
  *
@@ -83,7 +89,7 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
 
 /**
  * @warning
- * @b EXPERIMENTAL: this API may change without prior notice
+ * @b EXPERIMENTAL: this API may change without prior notice.
  *
  * Enter an architecture-defined optimized power state until a certain TSC
  * timestamp is reached.
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v20 3/4] power: add PMD power management API and callback
  2021-01-22 17:12                         ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Anatoly Burakov
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 1/4] eal: rename power monitor condition member Anatoly Burakov
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 2/4] eal: improve comments around power monitoring API Anatoly Burakov
@ 2021-01-22 17:12                           ` Anatoly Burakov
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 4/4] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  2021-01-29 14:20                           ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Thomas Monjalon
  4 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-22 17:12 UTC (permalink / raw)
  To: dev; +Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. TPAUSE/Pause instruction

   This method uses the pause (or TPAUSE, if available) instruction to
   avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v17:
    - Added memory barriers suggested by Konstantin
    - Removed the BUSY state

 doc/guides/prog_guide/power_man.rst    |  41 +++
 doc/guides/rel_notes/release_21_02.rst |  10 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 365 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  91 ++++++
 lib/librte_power/version.map           |   5 +
 6 files changed, 515 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..f36ba0027c 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,47 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+Ethernet PMD Power Management API
+---------------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * Monitor
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX
+   descriptor address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not available, use ``rte_pause()``.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing ``librte_power`` library
+   functionality to scale the core frequency up/down depending on traffic
+   volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for Ethernet PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core.
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core.
+
 References
 ----------
 
diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst
index ae36b6a3fa..81b9d93cd0 100644
--- a/doc/guides/rel_notes/release_21_02.rst
+++ b/doc/guides/rel_notes/release_21_02.rst
@@ -122,6 +122,16 @@ New Features
   * Added support for aes-cbc sha256-128-hmac cipher combination in OCTEON TX2
     crypto PMD lookaside protocol offload for IPsec.
 
+* **Added Ethernet PMD power management helper API.**
+
+  A new helper API has been added to make using Ethernet PMD power management
+  easier for the user: ``rte_power_ethdev_pmgmt_queue_enable()``. Three power
+  management schemes are supported initially:
+
+  * Power saving based on UMWAIT instruction (x86 only)
+  * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only)
+  * Power saving based on frequency scaling through the ``librte_power`` library
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 4b4cf1b90b..e5a11cb834 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer', 'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..454ef7091e
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,365 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+/* store some internal state */
+static struct pmd_conf_data {
+	/** what do we support? */
+	struct rte_cpu_intrinsics intrinsics_support;
+	/** pre-calculated tsc diff for 1us */
+	uint64_t tsc_per_us;
+	/** how many rte_pause can we fit in a microsecond? */
+	uint64_t pause_per_us;
+} global_data;
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED
+};
+
+struct pmd_queue_cfg {
+	volatile enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	volatile bool umwait_in_progress;
+	/**< are we currently sleeping? */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+static void
+calc_tsc(void)
+{
+	const uint64_t hz = rte_get_timer_hz();
+	const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */
+
+	global_data.tsc_per_us = tsc_per_us;
+
+	/* only do this if we don't have tpause */
+	if (!global_data.intrinsics_support.power_pause) {
+		const uint64_t start = rte_rdtsc_precise();
+		const uint32_t n_pauses = 10000;
+		double us, us_per_pause;
+		uint64_t end;
+		unsigned int i;
+
+		/* estimate number of rte_pause() calls per us*/
+		for (i = 0; i < n_pauses; i++)
+			rte_pause();
+
+		end = rte_rdtsc_precise();
+		us = (end - start) / (double)tsc_per_us;
+		us_per_pause = us / n_pauses;
+
+		global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause);
+	}
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc;
+			uint16_t ret;
+
+			/*
+			 * we might get a cancellation request while being
+			 * inside the callback, in which case the wakeup
+			 * wouldn't work because it would've arrived too early.
+			 *
+			 * to get around this, we notify the other thread that
+			 * we're sleeping, so that it can spin until we're done.
+			 * unsolicited wakeups are perfectly safe.
+			 */
+			q_conf->umwait_in_progress = true;
+
+			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+			/* check if we need to cancel sleep */
+			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+				/* use monitoring condition to sleep */
+				ret = rte_eth_get_monitor_addr(port_id, qidx,
+						&pmc);
+				if (ret == 0)
+					rte_power_monitor(&pmc, -1ULL);
+			}
+			q_conf->umwait_in_progress = false;
+
+			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
+		void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/* use tpause if we have it */
+			if (global_data.intrinsics_support.power_pause) {
+				const uint64_t cur = rte_rdtsc();
+				const uint64_t wait_tsc =
+						cur + global_data.tsc_per_us;
+				rte_power_pause(wait_tsc);
+			} else {
+				uint64_t i;
+				for (i = 0; i < global_data.pause_per_us; i++)
+					rte_pause();
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
+		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	struct rte_eth_dev_info info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if (rte_eth_dev_info_get(port_id, &info) < 0) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* check if queue id is valid */
+	if (queue_id >= info.nb_rx_queues) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* we need this in various places */
+	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		struct rte_power_monitor_cond dummy;
+
+		/* check if rte_power_monitor is supported */
+		if (!global_data.intrinsics_support.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_monitor_addr(port_id, queue_id,
+				&dummy) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->umwait_in_progress = false;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* ensure we update our state before callback starts */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* this is not necessary here, but do it anyway */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* figure out various time-to-tsc conversions */
+		if (global_data.tsc_per_us == 0)
+			calc_tsc();
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		/* this is not necessary here, but do it anyway */
+		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+end:
+	return ret;
+}
+
+int
+rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	/* stop any callbacks from progressing */
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	/* ensure we update our state before continuing */
+	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+	{
+		bool exit = false;
+		do {
+			/*
+			 * we may request cancellation while the other thread
+			 * has just entered the callback but hasn't started
+			 * sleeping yet, so keep waking it up until we know it's
+			 * done sleeping.
+			 */
+			if (queue_cfg->umwait_in_progress)
+				rte_power_monitor_wakeup(lcore_id);
+			else
+				exit = true;
+		} while (!exit);
+	}
+	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..7a0ac24625
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** Use power-optimized monitoring to wait for incoming traffic */
+	RTE_POWER_MGMT_TYPE_MONITOR = 1,
+	/** Use power-optimized sleep to avoid busy polling */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Use frequency scaling when traffic is low */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Enable power management on a specified Ethernet device Rx queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue will be polled from.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management scheme to use for specified Rx queue.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Disable power management on a specified Ethernet device Rx queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue is polled from.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..f38a380212 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,9 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+
+	# added in 21.02
+	rte_power_ethdev_pmgmt_queue_disable;
+	rte_power_ethdev_pmgmt_queue_enable;
+
 };
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v20 4/4] examples/l3fwd-power: enable PMD power mgmt
  2021-01-22 17:12                         ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Anatoly Burakov
                                             ` (2 preceding siblings ...)
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 3/4] power: add PMD power management API and callback Anatoly Burakov
@ 2021-01-22 17:12                           ` Anatoly Burakov
  2021-01-29 14:15                             ` Thomas Monjalon
  2021-01-29 14:20                           ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Thomas Monjalon
  4 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2021-01-22 17:12 UTC (permalink / raw)
  To: dev; +Cc: Liang Ma, David Hunt, thomas

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---

Notes:
    v20:
    - Moved Rx callback removal to before port close
    
    v12:
    - Allow selecting PMD power management scheme from command-line
    - Enforce 1 core 1 queue rule

 .../sample_app_ug/l3_forward_power_man.rst    | 35 ++++++++
 examples/l3fwd-power/main.c                   | 90 ++++++++++++++++++-
 2 files changed, 123 insertions(+), 2 deletions(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 85a78a5c1e..aaa9367fae 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"
+
+PMD Power Management Mode
+-------------------------
+There is also a traffic-aware operating mode that, instead of using explicit
+power management, will use automatic PMD power management. This mode is limited
+to one queue per core, and has three available power management schemes:
+
+* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a
+  power-optimized state (subject to platform support).
+
+* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid
+  busy looping when there is no traffic.
+
+* ``scale`` - this will use frequency scaling routines available in the
+  ``librte_power`` library.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK
+Programmer's Guide for more details on PMD power management.
+
+.. code-block:: console
+
+        ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 995a3b6ad7..0af8810697 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,11 +200,14 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
 
+static enum rte_power_pmd_mgmt_type pmgmt_type;
+
 enum freq_scale_hint_t
 {
 	FREQ_LOWER    =      -1,
@@ -1611,7 +1615,9 @@ print_usage(const char *prgname)
 		" follow (training_flag, high_threshold, med_threshold)\n"
 		" --telemetry: enable telemetry mode, to update"
 		" empty polls, full polls, and core busyness to telemetry\n"
-		" --interrupt-only: enable interrupt-only mode\n",
+		" --interrupt-only: enable interrupt-only mode\n"
+		" --pmd-mgmt MODE: enable PMD power management mode. "
+		"Currently supported modes: monitor, pause, scale\n",
 		prgname);
 }
 
@@ -1701,6 +1707,32 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+
+static int
+parse_pmd_mgmt_config(const char *name)
+{
+#define PMD_MGMT_MONITOR "monitor"
+#define PMD_MGMT_PAUSE   "pause"
+#define PMD_MGMT_SCALE   "scale"
+
+	if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE;
+		return 0;
+	}
+
+	if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) {
+		pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE;
+		return 0;
+	}
+	/* unknown PMD power management mode */
+	return -1;
+}
+
 static int
 parse_ep_config(const char *q_arg)
 {
@@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 1, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				if (parse_pmd_mgmt_config(optarg) < 0) {
+					printf(" Invalid PMD power management mode: %s\n",
+							optarg);
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2671,6 +2722,13 @@ main(int argc, char **argv)
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
+
+		/* PMD power management mode can only do 1 queue per core */
+		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
+			rte_exit(EXIT_FAILURE,
+				"In PMD power management mode, only one queue per lcore is allowed\n");
+		}
+
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2708,6 +2766,16 @@ main(int argc, char **argv)
 					rte_exit(EXIT_FAILURE,
 						 "Fail to add ptype cb\n");
 			}
+
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				ret = rte_power_ethdev_pmgmt_queue_enable(
+						lcore_id, portid, queueid,
+						pmgmt_type);
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+						"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
+							ret, portid);
+			}
 		}
 	}
 
@@ -2798,6 +2866,9 @@ main(int argc, char **argv)
 						SKIP_MAIN);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		/* reuse telemetry loop for PMD power management mode */
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2808,6 +2879,21 @@ main(int argc, char **argv)
 			return -1;
 	}
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+
+				rte_power_ethdev_pmgmt_queue_disable(lcore_id,
+						portid, queueid);
+			}
+		}
+	}
+
 	RTE_ETH_FOREACH_DEV(portid)
 	{
 		if ((enabled_port_mask & (1 << portid)) == 0)
-- 
2.25.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v20 1/4] eal: rename power monitor condition member
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 1/4] eal: rename power monitor condition member Anatoly Burakov
@ 2021-01-29 11:26                             ` Thomas Monjalon
  0 siblings, 0 replies; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-29 11:26 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, Timothy McDaniel, Beilei Xing, Jeff Guo, Qiming Yang,
	Qi Zhang, Haiyue Wang, Bruce Richardson, Konstantin Ananyev

22/01/2021 18:12, Anatoly Burakov:
> The `data_sz` name is fine, but it looks out of place because nothing
> else has "data" prefix in that structure. Rename it to "size", as well
> as add more clarity to the comments around each struct member.
> 
> Fixes: 6a17919b0e2a ("eal: change power intrinsics API")
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

Thanks for the precisions.
Acked-by: Thomas Monjalon <thomas@monjalon.net>




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v20 2/4] eal: improve comments around power monitoring API
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 2/4] eal: improve comments around power monitoring API Anatoly Burakov
@ 2021-01-29 11:27                             ` Thomas Monjalon
  0 siblings, 0 replies; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-29 11:27 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev

22/01/2021 18:12, Anatoly Burakov:
> Currently, the API documentation is ambiguous as to what happens when
> certain conditions are met. Document the behavior explicitly, as well as
> fix some typos and outdated comments.
> 
> Fixes: 6a17919b0e2a ("eal: change power intrinsics API")
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>   * Monitor specific address for changes. This will cause the CPU to enter an
>   * architecture-defined optimized power state until either the specified
>   * memory address is written to, a certain TSC timestamp is reached, or other
>   * reasons cause the CPU to wake up.
>   *
> - * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> - * mask is non-zero, the current value pointed to by the `p` pointer will be
> - * checked against the expected value, and if they match, the entering of
> - * optimized power state may be aborted.
> + * Additionally, an expected value (`pmc->val`), mask (`pmc->mask`), and data
> + * size (`pmc->size`) are provided in the `pmc` power monitoring condition. If
> + * the mask is non-zero, the current value pointed to by the `pmc->addr` pointer
> + * will be read and compared against the expected value, and if they match, the
> + * entering of optimized power state will be aborted. This is intended to
> + * prevent the CPU from entering optimized power state and waiting on a write
> + * that has already happened by the time this API is called.

I think that's a lot better. Thank you.
Acked-by: Thomas Monjalon <thomas@monjalon.net>




^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v20 4/4] examples/l3fwd-power: enable PMD power mgmt
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 4/4] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2021-01-29 14:15                             ` Thomas Monjalon
  0 siblings, 0 replies; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-29 14:15 UTC (permalink / raw)
  To: Liang Ma, Anatoly Burakov; +Cc: dev, David Hunt, bruce.richardson

22/01/2021 18:12, Anatoly Burakov:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add PMD power management feature support to l3fwd-power sample app.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: David Hunt <david.hunt@intel.com>
> ---
[...]
> +PMD Power Management Mode
> +-------------------------

There should be a blank line here.

> +There is also a traffic-aware operating mode that, instead of using explicit
> +power management, will use automatic PMD power management. This mode is limited
> +to one queue per core, and has three available power management schemes:
> +
> +* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a
> +  power-optimized state (subject to platform support).
> +
> +* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid
> +  busy looping when there is no traffic.
> +
> +* ``scale`` - this will use frequency scaling routines available in the
> +  ``librte_power`` library.

Better to use a definition list for such explanations:
https://docutils.sourceforge.io/docs/user/rst/quickref.html#definition-lists

I will update while merging.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v20 0/4] Add PMD power management
  2021-01-22 17:12                         ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Anatoly Burakov
                                             ` (3 preceding siblings ...)
  2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 4/4] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2021-01-29 14:20                           ` Thomas Monjalon
  2021-01-29 14:47                             ` Burakov, Anatoly
  4 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2021-01-29 14:20 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev, liang.j.ma, bruce.richardson, david.hunt

22/01/2021 18:12, Anatoly Burakov:
> Things of note:
> 
> - Only 1:1 core to queue mapping is supported, meaning that each lcore 
>   must at most handle RX on a single queue

Is there a way to have a more generic solution?
I think it may deserve some comments in the API.

> - Support 3 type policies. Monitor/Pause/Frequency Scaling
> - Power management is enabled per-queue
> - The API doesn't extend to other device types

Could it be extended to more device types?
Otherwise it should be called specifically ethdev power management.

> Anatoly Burakov (2):
>   eal: rename power monitor condition member
>   eal: improve comments around power monitoring API
> 
> Liang Ma (2):
>   power: add PMD power management API and callback
>   examples/l3fwd-power: enable PMD power mgmt

Applied with few formatting improvements, thanks



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v20 0/4] Add PMD power management
  2021-01-29 14:20                           ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Thomas Monjalon
@ 2021-01-29 14:47                             ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2021-01-29 14:47 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, liang.j.ma, bruce.richardson, david.hunt

On 29-Jan-21 2:20 PM, Thomas Monjalon wrote:
> 22/01/2021 18:12, Anatoly Burakov:
>> Things of note:
>>
>> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>>    must at most handle RX on a single queue
> 
> Is there a way to have a more generic solution?
> I think it may deserve some comments in the API.

If you're referring to possibility of monitoring multiple queues from 
one core, we are investigating ways to make that happen, but for now, 
this is the limitation.

> 
>> - Support 3 type policies. Monitor/Pause/Frequency Scaling
>> - Power management is enabled per-queue
>> - The API doesn't extend to other device types
> 
> Could it be extended to more device types?
> Otherwise it should be called specifically ethdev power management.

It can theoretically be extended to any device type that has callbacks. 
Current focus is obviously NICs, but in general, it doesn't have to be. 
Anything that polls and has callbacks should work.

> 
>> Anatoly Burakov (2):
>>    eal: rename power monitor condition member
>>    eal: improve comments around power monitoring API
>>
>> Liang Ma (2):
>>    power: add PMD power management API and callback
>>    examples/l3fwd-power: enable PMD power mgmt
> 
> Applied with few formatting improvements, thanks
> 
> 

Thanks!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

end of thread, other threads:[~2021-01-29 14:47 UTC | newest]

Thread overview: 421+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
2020-05-28 11:39   ` Ananyev, Konstantin
2020-05-28 14:40     ` Burakov, Anatoly
2020-05-28 14:58       ` Bruce Richardson
2020-05-28 15:38       ` Ananyev, Konstantin
2020-05-29  6:56         ` Jerin Jacob
2020-06-02 10:15           ` Ananyev, Konstantin
2020-06-03  6:22           ` Honnappa Nagarahalli
2020-06-03  6:31             ` Jerin Jacob
2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API Liang Ma
2020-11-02 12:23       ` Burakov, Anatoly
2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 2/6] power: add PMD power management API and callback Liang Ma
2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 3/6] net/ixgbe: implement power management API Liang Ma
2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 4/6] net/i40e: " Liang Ma
2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 5/6] net/ice: " Liang Ma
2020-11-02 11:10     ` [dpdk-dev] [PATCH v11 6/6] examples/l3fwd-power: enable PMD power mgmt Liang Ma
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 00/11] Add PMD power management Anatoly Burakov
2020-12-17 16:12       ` David Marchand
2021-01-08 16:42         ` Burakov, Anatoly
2021-01-11  8:44           ` David Marchand
2021-01-11  8:52             ` David Marchand
2021-01-11 10:21               ` Burakov, Anatoly
2021-01-08 17:42       ` [dpdk-dev] [PATCH v13 " Anatoly Burakov
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics Anatoly Burakov
2021-01-12 15:54           ` Ananyev, Konstantin
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in " Anatoly Burakov
2021-01-08 19:58           ` Stephen Hemminger
2021-01-11 10:21             ` Burakov, Anatoly
2021-01-12 15:56           ` Ananyev, Konstantin
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 03/11] eal: change API of " Anatoly Burakov
2021-01-12 15:58           ` Ananyev, Konstantin
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor Anatoly Burakov
2021-01-12 15:59           ` Ananyev, Konstantin
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function Anatoly Burakov
2021-01-12 16:02           ` Ananyev, Konstantin
2021-01-12 16:18             ` Burakov, Anatoly
2021-01-12 16:25               ` Burakov, Anatoly
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API Anatoly Burakov
2021-01-09  8:04           ` Andrew Rybchenko
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 07/11] power: add PMD power management API and callback Anatoly Burakov
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 08/11] net/ixgbe: implement power management API Anatoly Burakov
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 09/11] net/i40e: " Anatoly Burakov
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 10/11] net/ice: " Anatoly Burakov
2021-01-08 17:42         ` [dpdk-dev] [PATCH v13 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2021-01-11 14:35         ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 01/11] eal: uninline power intrinsics Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 02/11] eal: avoid invalid API usage in " Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 03/11] eal: change API of " Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 04/11] eal: remove sync version of power monitor Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 05/11] eal: add monitor wakeup function Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 06/11] ethdev: add simple power management API Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 07/11] power: add PMD power management API and callback Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 08/11] net/ixgbe: implement power management API Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 09/11] net/i40e: " Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 10/11] net/ice: " Anatoly Burakov
2021-01-11 14:35           ` [dpdk-dev] [PATCH v14 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2021-01-11 14:58           ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics Anatoly Burakov
2021-01-12 16:09               ` Ananyev, Konstantin
2021-01-12 16:14                 ` Burakov, Anatoly
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 02/11] eal: avoid invalid API usage in " Anatoly Burakov
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 03/11] eal: change API of " Anatoly Burakov
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 04/11] eal: remove sync version of power monitor Anatoly Burakov
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 05/11] eal: add monitor wakeup function Anatoly Burakov
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 06/11] ethdev: add simple power management API Anatoly Burakov
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 07/11] power: add PMD power management API and callback Anatoly Burakov
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 08/11] net/ixgbe: implement power management API Anatoly Burakov
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 09/11] net/i40e: " Anatoly Burakov
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 10/11] net/ice: " Anatoly Burakov
2021-01-11 14:58             ` [dpdk-dev] [PATCH v15 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2021-01-12 17:37             ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 01/11] eal: uninline power intrinsics Anatoly Burakov
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 02/11] eal: avoid invalid API usage in " Anatoly Burakov
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 03/11] eal: change API of " Anatoly Burakov
2021-01-13 13:01                 ` Ananyev, Konstantin
2021-01-13 17:22                   ` Burakov, Anatoly
2021-01-13 18:01                     ` Ananyev, Konstantin
2021-01-14 10:23                       ` Burakov, Anatoly
2021-01-14 12:33                         ` Ananyev, Konstantin
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 04/11] eal: remove sync version of power monitor Anatoly Burakov
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function Anatoly Burakov
2021-01-13 12:46                 ` Ananyev, Konstantin
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API Anatoly Burakov
2021-01-13 13:18                 ` Ananyev, Konstantin
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback Anatoly Burakov
2021-01-13 12:58                 ` Ananyev, Konstantin
2021-01-13 17:29                   ` Burakov, Anatoly
2021-01-14 13:00                     ` Burakov, Anatoly
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 08/11] net/ixgbe: implement power management API Anatoly Burakov
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 09/11] net/i40e: " Anatoly Burakov
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 10/11] net/ice: " Anatoly Burakov
2021-01-12 17:37               ` [dpdk-dev] [PATCH v16 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2021-01-14  9:36               ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management David Marchand
2021-01-14 10:25                 ` Burakov, Anatoly
2021-01-14 14:46               ` [dpdk-dev] [PATCH v17 " Anatoly Burakov
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 01/11] eal: uninline power intrinsics Anatoly Burakov
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 02/11] eal: avoid invalid API usage in " Anatoly Burakov
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 03/11] eal: change API of " Anatoly Burakov
2021-01-18 22:26                   ` Thomas Monjalon
2021-01-19 10:29                     ` Burakov, Anatoly
2021-01-19 10:42                       ` Thomas Monjalon
2021-01-19 11:23                         ` Burakov, Anatoly
2021-01-19 14:17                           ` Thomas Monjalon
2021-01-20 10:32                             ` Burakov, Anatoly
2021-01-20 10:38                               ` Thomas Monjalon
2021-01-20 11:05                                 ` Burakov, Anatoly
2021-01-20 11:11                                   ` Thomas Monjalon
2021-01-20 11:17                                     ` Burakov, Anatoly
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 04/11] eal: remove sync version of power monitor Anatoly Burakov
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 05/11] eal: add monitor wakeup function Anatoly Burakov
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 06/11] ethdev: add simple power management API Anatoly Burakov
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback Anatoly Burakov
2021-01-18 12:41                   ` David Hunt
2021-01-19 16:45                     ` [dpdk-dev] [PATCH v18 0/2] Add PMD power management Anatoly Burakov
2021-01-19 16:45                       ` [dpdk-dev] [PATCH v18 1/2] power: add PMD power management API and callback Anatoly Burakov
2021-01-19 16:45                       ` [dpdk-dev] [PATCH v18 2/2] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2021-01-20 11:50                       ` [dpdk-dev] [PATCH v19 0/4] Add PMD power management Anatoly Burakov
2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 1/4] eal: rename power monitor condition member Anatoly Burakov
2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 2/4] eal: improve comments around power monitoring API Anatoly Burakov
2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 3/4] power: add PMD power management API and callback Anatoly Burakov
2021-01-20 11:50                         ` [dpdk-dev] [PATCH v19 4/4] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2021-01-22 17:12                         ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Anatoly Burakov
2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 1/4] eal: rename power monitor condition member Anatoly Burakov
2021-01-29 11:26                             ` Thomas Monjalon
2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 2/4] eal: improve comments around power monitoring API Anatoly Burakov
2021-01-29 11:27                             ` Thomas Monjalon
2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 3/4] power: add PMD power management API and callback Anatoly Burakov
2021-01-22 17:12                           ` [dpdk-dev] [PATCH v20 4/4] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2021-01-29 14:15                             ` Thomas Monjalon
2021-01-29 14:20                           ` [dpdk-dev] [PATCH v20 0/4] Add PMD power management Thomas Monjalon
2021-01-29 14:47                             ` Burakov, Anatoly
2021-01-18 22:48                   ` [dpdk-dev] [PATCH v17 07/11] power: add PMD power management API and callback Thomas Monjalon
2021-01-19 12:25                     ` Burakov, Anatoly
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 08/11] net/ixgbe: implement power management API Anatoly Burakov
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 09/11] net/i40e: " Anatoly Burakov
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 10/11] net/ice: " Anatoly Burakov
2021-01-14 14:46                 ` [dpdk-dev] [PATCH v17 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2021-01-18 12:53                   ` David Hunt
2021-01-18 15:24                 ` [dpdk-dev] [PATCH v17 00/11] Add PMD power management David Marchand
2021-01-18 15:45                   ` Burakov, Anatoly
2021-01-18 16:06                     ` Thomas Monjalon
2021-01-18 17:02                       ` Burakov, Anatoly
2021-01-18 17:54                         ` David Marchand
2021-01-18 22:52                 ` Thomas Monjalon
2021-01-19 10:30                   ` Burakov, Anatoly
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 01/11] eal: uninline power intrinsics Anatoly Burakov
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 02/11] eal: avoid invalid API usage in " Anatoly Burakov
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 03/11] eal: change API of " Anatoly Burakov
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 04/11] eal: remove sync version of power monitor Anatoly Burakov
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 05/11] eal: add monitor wakeup function Anatoly Burakov
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 06/11] ethdev: add simple power management API Anatoly Burakov
2020-12-28 11:00       ` Andrew Rybchenko
2021-01-08 16:30         ` Burakov, Anatoly
2021-01-12 20:32       ` Lance Richardson
2021-01-13 13:04         ` Burakov, Anatoly
2021-01-13 13:25           ` Ananyev, Konstantin
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 07/11] power: add PMD power management API and callback Anatoly Burakov
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 08/11] net/ixgbe: implement power management API Anatoly Burakov
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 09/11] net/i40e: " Anatoly Burakov
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 10/11] net/ice: " Anatoly Burakov
2020-12-17 14:05     ` [dpdk-dev] [PATCH v12 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
2020-05-28 12:15   ` Ananyev, Konstantin
2020-05-27 17:02 ` [dpdk-dev] [RFC 3/6] net/ixgbe: implement " Anatoly Burakov
2020-05-27 17:02 ` [dpdk-dev] [RFC 4/6] net/i40e: " Anatoly Burakov
2020-05-27 17:02 ` [dpdk-dev] [RFC 5/6] net/ice: " Anatoly Burakov
2020-05-27 17:02 ` [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port Anatoly Burakov
2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
2020-05-27 20:57   ` Stephen Hemminger
2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
2020-08-13 18:11     ` Liang, Ma
2020-08-11 10:27   ` [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API Liang Ma
2020-08-11 10:27   ` [dpdk-dev] [RFC v2 4/5] net/i40e: " Liang Ma
2020-08-11 10:27   ` [dpdk-dev] [RFC v2 5/5] net/ice: " Liang Ma
2020-08-13 18:04   ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang, Ma
2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback Liang Ma
2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: " Liang Ma
2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 6/6] net/ice: " Liang Ma
2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
2020-09-04 16:37       ` Stephen Hemminger
2020-09-14 21:04         ` Liang, Ma
2020-09-04 20:54       ` Ananyev, Konstantin
2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
2020-09-04 16:36       ` Stephen Hemminger
2020-09-14 20:52         ` Liang, Ma
2020-09-04 18:33       ` Ananyev, Konstantin
2020-09-14 21:01         ` Liang, Ma
2020-09-16 14:53           ` Ananyev, Konstantin
2020-09-16 16:39             ` Liang, Ma
2020-09-16 16:44               ` Ananyev, Konstantin
2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 5/6] net/i40e: " Liang Ma
2020-09-04 10:19     ` [dpdk-dev] [PATCH v3 6/6] net/ice: " Liang Ma
2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
2020-09-14 20:48       ` Liang, Ma
2020-09-04 16:37     ` Stephen Hemminger
2020-09-14 20:49       ` Liang, Ma
2020-09-04 18:42     ` Stephen Hemminger
2020-09-14 21:12       ` Liang, Ma
2020-09-16 16:34       ` Liang, Ma
2020-09-06 21:44     ` Ananyev, Konstantin
2020-09-18  5:01     ` Jerin Jacob
2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
2020-10-08  8:33         ` Thomas Monjalon
2020-10-08  8:44           ` Jerin Jacob
2020-10-08  9:41             ` Thomas Monjalon
2020-10-08 13:26             ` Burakov, Anatoly
2020-10-08 15:13               ` Jerin Jacob
2020-10-08 17:07                 ` Ananyev, Konstantin
2020-10-09  5:42                   ` Jerin Jacob
2020-10-09  9:25                     ` Burakov, Anatoly
2020-10-09  9:29                       ` Thomas Monjalon
2020-10-09  9:40                         ` Burakov, Anatoly
2020-10-09  9:54                           ` Jerin Jacob
2020-10-09 10:03                             ` Burakov, Anatoly
2020-10-09 10:17                               ` Thomas Monjalon
2020-10-09 10:22                                 ` Burakov, Anatoly
2020-10-09 10:45                                   ` Jerin Jacob
2020-10-09 10:48                                   ` Ananyev, Konstantin
2020-10-09 11:12                                     ` Burakov, Anatoly
2020-10-09 11:36                                       ` Bruce Richardson
2020-10-09 11:42                                         ` Burakov, Anatoly
2020-10-09 10:19                               ` Jerin Jacob
2020-10-08 17:15         ` Ananyev, Konstantin
2020-10-09  9:11           ` Burakov, Anatoly
2020-10-09 15:39             ` Ananyev, Konstantin
2020-10-09 16:10               ` Burakov, Anatoly
2020-10-09 16:56                 ` Ananyev, Konstantin
2020-10-09 16:59                   ` Burakov, Anatoly
2020-10-10 13:19                     ` Ananyev, Konstantin
2020-10-12 10:35                       ` Burakov, Anatoly
2020-10-12 10:36                         ` Burakov, Anatoly
2020-10-12 12:50                           ` Ananyev, Konstantin
2020-10-12 13:13                             ` Burakov, Anatoly
2020-10-13  9:45                               ` Burakov, Anatoly
2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API Liang Ma
2020-10-08  8:46         ` Thomas Monjalon
2020-10-08 11:39           ` Ananyev, Konstantin
2020-10-08 22:26         ` Ananyev, Konstantin
2020-10-09 16:11           ` Burakov, Anatoly
2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback Liang Ma
2020-10-09 16:38         ` Ananyev, Konstantin
2020-10-09 16:47           ` Burakov, Anatoly
2020-10-09 16:51             ` Ananyev, Konstantin
2020-10-09 16:56               ` Burakov, Anatoly
2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API Liang Ma
2020-10-09 15:53         ` Ananyev, Konstantin
2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 06/10] net/i40e: " Liang Ma
2020-10-09 16:01         ` Ananyev, Konstantin
2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 07/10] net/ice: " Liang Ma
2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 08/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 09/10] doc: update release notes for PMD power management Liang Ma
2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 10/10] doc: update the programming guide " Liang Ma
2020-10-02 14:44       ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Bruce Richardson
2020-10-08 22:08       ` Ananyev, Konstantin
2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
2020-10-09 17:06         ` Burakov, Anatoly
2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 " Anatoly Burakov
2020-10-23 17:17             ` [dpdk-dev] [PATCH v8 " Liang Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 00/10] Add PMD power mgmt Liang Ma
2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 0/9] " Liang Ma
2020-10-27 16:02                   ` Thomas Monjalon
2020-10-28 13:35                     ` Liang, Ma
2020-10-28 13:49                       ` Jerin Jacob
2020-10-28 14:21                         ` Thomas Monjalon
2020-10-28 14:57                           ` Ananyev, Konstantin
2020-10-28 15:14                             ` Jerin Jacob
2020-10-28 15:30                               ` Liang, Ma
2020-10-28 15:36                                 ` Jerin Jacob
2020-10-28 15:44                                   ` Liang, Ma
2020-10-28 16:01                                     ` Jerin Jacob
2020-10-28 16:21                                       ` Liang, Ma
2020-10-28 15:33                               ` Ananyev, Konstantin
2020-10-28 15:39                                 ` Jerin Jacob
2020-10-28 15:49                                   ` Ananyev, Konstantin
2020-10-28 15:57                                     ` Jerin Jacob
2020-10-28 16:38                                       ` Ananyev, Konstantin
2020-10-28 16:47                                       ` Liang, Ma
2020-10-28 16:54                                         ` McDaniel, Timothy
2020-10-29  9:19                                           ` Liang, Ma
2020-10-28 17:02                                         ` Ajit Khaparde
2020-10-28 18:10                                           ` Ananyev, Konstantin
2020-10-27 20:53                   ` Ajit Khaparde
2020-10-28 12:13                     ` Liang, Ma
2020-10-29 17:42                   ` Thomas Monjalon
2020-10-30  9:36                     ` Liang, Ma
2020-10-30  9:58                       ` Thomas Monjalon
2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 1/9] eal: add new x86 cpuid support for WAITPKG Liang Ma
2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 2/9] eal: add power management intrinsics Liang Ma
2020-10-29 17:39                   ` Thomas Monjalon
2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 3/9] eal: add intrinsics support check infrastructure Liang Ma
2020-10-29 21:27                   ` David Marchand
2020-10-30 10:09                     ` Burakov, Anatoly
2020-10-30 10:14                       ` Thomas Monjalon
2020-10-30 13:37                         ` Burakov, Anatoly
2020-10-30 14:09                           ` Thomas Monjalon
2020-10-30 15:27                             ` Burakov, Anatoly
2020-10-30 15:44                               ` Thomas Monjalon
2020-10-30 16:36                                 ` Burakov, Anatoly
2020-10-30 16:50                                   ` Thomas Monjalon
2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 4/9] ethdev: add simple power management API Liang Ma
2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 5/9] power: add PMD power management API and callback Liang Ma
2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 6/9] net/ixgbe: implement power management API Liang Ma
2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 7/9] net/i40e: " Liang Ma
2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 8/9] net/ice: " Liang Ma
2020-10-27 14:59                 ` [dpdk-dev] [PATCH v10 9/9] examples/l3fwd-power: enable PMD power mgmt Liang Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 02/10] eal: add power management intrinsics Liang Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 03/10] eal: add intrinsics support check infrastructure Liang Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 04/10] ethdev: add simple power management API Liang Ma
2020-10-24 20:39                 ` Thomas Monjalon
2020-10-27 11:15                   ` Liang, Ma
2020-10-27 15:52                     ` Thomas Monjalon
2020-10-27 17:43                       ` Ananyev, Konstantin
2020-10-27 18:30                         ` Thomas Monjalon
2020-10-27 23:29                           ` Ananyev, Konstantin
2020-10-28  3:24                             ` Ajit Khaparde
2020-10-28 12:24                             ` Liang, Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 05/10] power: add PMD power management API and callback Liang Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 06/10] net/ixgbe: implement power management API Liang Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 07/10] net/i40e: " Liang Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 08/10] net/ice: " Liang Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 09/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
2020-10-23 23:06               ` [dpdk-dev] [PATCH v9 10/10] doc: update programmer's guide for power library Liang Ma
2020-10-24 20:49                 ` Thomas Monjalon
2020-10-27 11:04                   ` Liang, Ma
2020-10-23 17:20             ` [dpdk-dev] [PATCH v8 02/10] eal: add power management intrinsics Liang Ma
2020-10-23 17:21             ` [dpdk-dev] [PATCH v8 03/10] eal: add intrinsics support check infrastructure Liang Ma
2020-10-23 17:22             ` [dpdk-dev] [PATCH v8 04/10] ethdev: add simple power management API Liang Ma
2020-10-23 17:23             ` [dpdk-dev] [PATCH v8 05/10] power: add PMD power management API and callback Liang Ma
2020-10-23 17:26             ` [dpdk-dev] [PATCH v8 06/10] net/ixgbe: implement power management API Liang Ma
2020-10-23 17:26             ` [dpdk-dev] [PATCH v8 07/10] net/i40e: " Liang Ma
2020-10-23 17:27             ` [dpdk-dev] [PATCH v8 08/10] net/ice: " Liang Ma
2020-10-23 17:30             ` [dpdk-dev] [PATCH v8 09/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
2020-10-23 17:36             ` [dpdk-dev] [PATCH v8 10/10] doc: update programmer's guide for power library Liang Ma
2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 02/10] eal: add power management intrinsics Anatoly Burakov
2020-10-15 12:06             ` Jerin Jacob
2020-10-15 13:16             ` Ferruh Yigit
2020-10-16  8:44               ` Ruifeng Wang
2020-10-15 16:43             ` Ananyev, Konstantin
2020-10-19 21:12             ` Thomas Monjalon
2020-10-20  2:49               ` Ruifeng Wang
2020-10-20  7:35                 ` Thomas Monjalon
2020-10-20 14:01                   ` David Hunt
2020-10-20 14:17                     ` David Hunt
2020-10-20 14:33                       ` Thomas Monjalon
2020-10-20 17:26                         ` Ananyev, Konstantin
2020-10-20 19:28                           ` Thomas Monjalon
2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
2020-10-16  9:02             ` Ruifeng Wang
2020-10-16 11:21             ` Kinsella, Ray
2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 04/10] ethdev: add simple power management API Anatoly Burakov
2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 05/10] power: add PMD power management API and callback Anatoly Burakov
2020-10-15 16:52             ` Ananyev, Konstantin
2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 06/10] net/ixgbe: implement power management API Anatoly Burakov
2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 07/10] net/i40e: " Anatoly Burakov
2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 08/10] net/ice: " Anatoly Burakov
2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2020-10-15 12:04           ` [dpdk-dev] [PATCH v7 10/10] doc: update programmer's guide for power library Anatoly Burakov
2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 02/10] eal: add power management intrinsics Anatoly Burakov
2020-10-14 17:48           ` Ananyev, Konstantin
2020-10-15 10:09             ` Burakov, Anatoly
2020-10-15 10:45               ` Burakov, Anatoly
2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
2020-10-14 17:51           ` Ananyev, Konstantin
2020-10-14 17:59           ` Jerin Jacob
2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 04/10] ethdev: add simple power management API Anatoly Burakov
2020-10-14 17:06           ` Ananyev, Konstantin
2020-10-15 11:29           ` Liang, Ma
2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback Anatoly Burakov
2020-10-14 14:19           ` David Hunt
2020-10-14 18:41           ` Ananyev, Konstantin
2020-10-15 10:31             ` Burakov, Anatoly
2020-10-15 16:02               ` Ananyev, Konstantin
2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 06/10] net/ixgbe: implement power management API Anatoly Burakov
2020-10-14 17:04           ` Wang, Haiyue
2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 07/10] net/i40e: " Anatoly Burakov
2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 08/10] net/ice: " Anatoly Burakov
2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2020-10-14 14:24           ` David Hunt
2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 10/10] doc: update programmer's guide for power library Anatoly Burakov
2020-10-14 14:27           ` David Hunt
2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics Anatoly Burakov
2020-10-09 16:09         ` Jerin Jacob
2020-10-09 16:24           ` Burakov, Anatoly
2020-10-12 19:47         ` David Christensen
2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
2020-10-11 10:07         ` Jerin Jacob
2020-10-12  9:26           ` Burakov, Anatoly
2020-10-12 19:52         ` David Christensen
2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API Anatoly Burakov
2020-10-14  3:10         ` Guo, Jia
2020-10-14  9:07           ` Burakov, Anatoly
2020-10-14  9:15             ` Guo, Jia
2020-10-14  9:30               ` Burakov, Anatoly
2020-10-14  9:23             ` Bruce Richardson
2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 05/10] power: add PMD power management API and callback Anatoly Burakov
2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API Anatoly Burakov
2020-10-12  7:46         ` Wang, Haiyue
2020-10-12  9:28           ` Burakov, Anatoly
2020-10-12  9:44           ` Burakov, Anatoly
2020-10-12 15:58             ` Wang, Haiyue
2020-10-12  8:09         ` Wang, Haiyue
2020-10-12  9:28           ` Burakov, Anatoly
2020-10-13  1:17             ` Wang, Haiyue
2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 07/10] net/i40e: " Anatoly Burakov
2020-10-14  3:19         ` Guo, Jia
2020-10-14  9:08           ` Burakov, Anatoly
2020-10-14  9:17             ` Guo, Jia
2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 08/10] net/ice: " Anatoly Burakov
2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 10/10] doc: update programmer's guide for power library Anatoly Burakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).