DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices
@ 2020-05-27 17:02 Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
                   ` (7 more replies)
  0 siblings, 8 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, liang.j.ma

This patchset proposes a simple API for Ethernet drivers
to cause the CPU to enter a power-optimized state while
waiting for packets to arrive, along with a set of
(hopefully generic) intrinsics that facilitate that. This
is achieved through cooperation with the NIC driver that
will allow us to know address of the next NIC RX ring
packet descriptor, and wait for writes on it.

On IA, this is achieved through using UMONITOR/UMWAIT
instructions. They are used in their raw opcode form
because there is no widespread compiler support for
them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen
to implement similar instructions.

To achieve power savings, there is a very simple mechanism
used: we're counting empty polls, and if a certain threshold
is reached, we get the address of next RX ring descriptor
from the NIC driver, arm the monitoring hardware, and
enter a power-optimized state. We will then wake up when
either a timeout happens, or a write happens (or generally
whenever CPU feels like waking up - this is platform-
specific), and proceed as normal. The empty poll counter is
reset whenever we actually get packets, so we only go to
sleep when we know nothing is going on.

Why are we putting it into ethdev as opposed to leaving
this up to the application? Our customers specifically
requested a way to do it wit minimal changes to the
application code. The current approach allows to just
flip a switch and automagically have power savings.

There are certain limitations in this patchset right now:
- Currently, only 1:1 core to queue mapping is supported,
  meaning that each lcore must at most handle RX on a
  single queue
- Currently, power management is enabled per-port, not
  per-queue
- There is potential to greatly increase TX latency if we
  are buffering things, and go to sleep before sending
  packets
- The API is not perfect and could use some improvement
  and discussion
- The API doesn't extend to other device types
- The intrinsics are platform-specific, so ethdev has
  some platform-specific code in it
- Support was only implemented for devices using
  net/ixgbe, net/i40e and net/ice drivers

Hopefully this would generate enough feedback to clear
a path forward!

Anatoly Burakov (6):
  eal: add power management intrinsics
  ethdev: add simple power management API
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  app/testpmd: add command for power management on a port

 app/test-pmd/cmdline.c                        |  48 +++++++
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  23 +++
 drivers/net/i40e/i40e_rxtx.h                  |   2 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  23 +++
 drivers/net/ice/ice_rxtx.h                    |   2 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  22 +++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
 .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 lib/librte_ethdev/rte_ethdev.c                |  39 +++++
 lib/librte_ethdev/rte_ethdev.h                |  70 +++++++++
 lib/librte_ethdev/rte_ethdev_core.h           |  41 +++++-
 lib/librte_ethdev/rte_ethdev_version.map      |   4 +
 20 files changed, 480 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-28 11:39   ` Ananyev, Konstantin
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
  2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, Konstantin Ananyev, david.hunt, liang.j.ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 203 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..8646c4ac16
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index bc73ec2c5c..b54a2be4f6 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -59,6 +59,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..94d6a43763 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
 	RTE_CPUFLAG_EM64T,                  /**< EM64T */
 
+	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
 	/* (EAX 80000007h) EDX features */
 	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
 
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..a0522400fb
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,134 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+	rte_mb();
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 2/6] ethdev: add simple power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-28 12:15   ` Ananyev, Konstantin
  2020-05-27 17:02 ` [dpdk-dev] [RFC 3/6] net/ixgbe: implement " Anatoly Burakov
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev
  Cc: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella,
	Neil Horman, david.hunt, liang.j.ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 queue use case as there is no
coordination between queues/cores in ethdev.

The TSC timestamp is automatically calculated using current link
speed and RX descriptor ring size, such that the sleep time is
not longer than it would take for a NIC to fill its entire RX
descriptor ring.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.c           | 39 +++++++++++++
 lib/librte_ethdev/rte_ethdev.h           | 70 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_core.h      | 41 +++++++++++++-
 lib/librte_ethdev/rte_ethdev_version.map |  4 ++
 4 files changed, 152 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 8e10a6fc36..0be5ecfc11 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -16,6 +16,7 @@
 #include <netinet/in.h>
 
 #include <rte_byteorder.h>
+#include <rte_cpuflags.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_interrupts.h>
@@ -5053,6 +5054,44 @@ rte_eth_dev_pool_ops_supported(uint16_t port_id, const char *pool)
 	return (*dev->dev_ops->pool_ops_supported)(dev, pool);
 }
 
+int
+rte_eth_dev_power_mgmt_enable(uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+		return -ENOTSUP;
+
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+		sizeof(struct rte_eth_ep_stat) * RTE_MAX_QUEUES_PER_PORT,
+		0, dev->data->numa_node);
+
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_eth_dev_power_mgmt_disable(uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* rte_free ignores NULL so safe to call without checks */
+	rte_free(dev->empty_poll_stats);
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+	return 0;
+}
+
 /**
  * A set of values to describe the possible states of a switch domain.
  */
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index a49242bcd2..b8318f7e91 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -666,6 +667,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1490,6 +1492,16 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+/**
+ * Possible power managment states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
@@ -4302,6 +4314,38 @@ __rte_experimental
 int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
 				       struct rte_eth_hairpin_cap *cap);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_enable(uint16_t port_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_disable(uint16_t port_id);
+
 #include <rte_ethdev_core.h>
 
 /**
@@ -4417,6 +4461,32 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 		} while (cb != NULL);
 	}
 #endif
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+			dev->empty_poll_stats[queue_id].num++;
+			if (unlikely(dev->empty_poll_stats[queue_id].num >
+					ETH_EMPTYPOLL_MAX)) {
+				volatile void *target_addr;
+				uint64_t expected, mask;
+				int ret;
+
+				/*
+				 * get address of next descriptor in the RX
+				 * ring for this queue, as well as expected
+				 * value and a mask.
+				 */
+				ret = (*dev->dev_ops->next_rx_desc)
+					(dev->data->rx_queues[queue_id],
+					 &target_addr, &expected, &mask);
+				if (ret == 0)
+					/* -1ULL is maximum value for TSC */
+					rte_power_monitor(target_addr,
+							  expected, mask,
+							  0, -1ULL);
+			}
+		} else
+			dev->empty_poll_stats[queue_id].num = 0;
+	}
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
 	return nb_rx;
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd418..4e23d465f0 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,27 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ *
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +773,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +791,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +838,14 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	uint32_t reserved_32;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	void *reserved_ptrs[3];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index 7155056045..141361823d 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -241,4 +241,8 @@ EXPERIMENTAL {
 	__rte_ethdev_trace_rx_burst;
 	__rte_ethdev_trace_tx_burst;
 	rte_flow_get_aged_flows;
+
+	# added in 20.08
+	rte_eth_dev_power_mgmt_disable;
+	rte_eth_dev_power_mgmt_enable;
 };
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 3/6] net/ixgbe: implement power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 4/6] net/i40e: " Anatoly Burakov
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Wei Zhao, Jeff Guo, david.hunt, liang.j.ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index a4e5c539de..190d11d98d 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -605,6 +605,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 2e20e18c7a..ef2fb5fca9 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 20a8b291d4..6c35966c78 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 4/6] net/i40e: implement power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (2 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 3/6] net/ixgbe: implement " Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 5/6] net/ice: " Anatoly Burakov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Beilei Xing, Jeff Guo, david.hunt, liang.j.ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 749d85f544..f3ce54911b 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -526,6 +526,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5e7c86ed82..76dfbb2098 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 8f11f011a7..72d810475b 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -245,6 +245,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 5/6] net/ice: implement power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (3 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 4/6] net/i40e: " Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port Anatoly Burakov
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Qiming Yang, Wenzhuo Lu, david.hunt, liang.j.ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index d5110c4392..db8269a548 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -219,6 +219,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 1c9f31efdf..80fd6bd134 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d04..7eb6fa904e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (4 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 5/6] net/ice: " Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  7 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu, Beilei Xing, Bernard Iremonger, david.hunt, liang.j.ma

A quick-and-dirty testpmd command to enable power management on
a specific port.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 app/test-pmd/cmdline.c | 48 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 996a498768..e3a5e19485 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -1773,6 +1773,53 @@ cmdline_parse_inst_t cmd_config_speed_specific = {
 	},
 };
 
+/* *** enable power management for specific port *** */
+struct cmd_port_pmgmt {
+	cmdline_fixed_string_t port;
+	portid_t id;
+	cmdline_fixed_string_t pmgmt;
+	cmdline_fixed_string_t on;
+};
+
+static void
+cmd_port_pmgmt_parsed(void *parsed_result,
+				__rte_unused struct cmdline *cl,
+				__rte_unused void *data)
+{
+	struct cmd_port_pmgmt *res = parsed_result;
+
+	if (port_id_is_invalid(res->id, ENABLED_WARN))
+		return;
+
+	if (!strcmp(res->on, "on"))
+		rte_eth_dev_power_mgmt_enable(res->id);
+	else if (!strcmp(res->on, "off"))
+		rte_eth_dev_power_mgmt_disable(res->id);
+}
+
+
+cmdline_parse_token_string_t cmd_port_pmgmt_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_pmgmt, port, "port");
+cmdline_parse_token_num_t cmd_port_pmgmt_id =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_pmgmt, id, UINT16);
+cmdline_parse_token_string_t cmd_port_pmgmt_item1 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_pmgmt, pmgmt, "power-mgmt");
+cmdline_parse_token_string_t cmd_port_pmgmt_value1 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_pmgmt, on, "on#off");
+
+cmdline_parse_inst_t cmd_port_pmgmt = {
+	.f = cmd_port_pmgmt_parsed,
+	.data = NULL,
+	.help_str = "port <port_id> power-mgmt on|off",
+	.tokens = {
+		(void *)&cmd_port_pmgmt_port,
+		(void *)&cmd_port_pmgmt_id,
+		(void *)&cmd_port_pmgmt_item1,
+		(void *)&cmd_port_pmgmt_value1,
+		NULL,
+	},
+};
+
 /* *** configure loopback for all ports *** */
 struct cmd_config_loopback_all {
 	cmdline_fixed_string_t port;
@@ -19692,6 +19739,7 @@ cmdline_parse_ctx_t main_ctx[] = {
 	(cmdline_parse_inst_t *)&cmd_show_set_raw,
 	(cmdline_parse_inst_t *)&cmd_show_set_raw_all,
 	(cmdline_parse_inst_t *)&cmd_config_tx_dynf_specific,
+	(cmdline_parse_inst_t *)&cmd_port_pmgmt,
 	NULL,
 };
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (5 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port Anatoly Burakov
@ 2020-05-27 17:33 ` Jerin Jacob
  2020-05-27 20:57   ` Stephen Hemminger
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  7 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-05-27 17:33 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dpdk-dev, David Hunt, Liang Ma

On Wed, May 27, 2020 at 10:32 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> This patchset proposes a simple API for Ethernet drivers
> to cause the CPU to enter a power-optimized state while
> waiting for packets to arrive, along with a set of
> (hopefully generic) intrinsics that facilitate that. This
> is achieved through cooperation with the NIC driver that
> will allow us to know address of the next NIC RX ring
> packet descriptor, and wait for writes on it.
>
> On IA, this is achieved through using UMONITOR/UMWAIT
> instructions. They are used in their raw opcode form
> because there is no widespread compiler support for
> them yet. Still, the API is made generic enough to
> hopefully support other architectures, if they happen
> to implement similar instructions.
>
> To achieve power savings, there is a very simple mechanism
> used: we're counting empty polls, and if a certain threshold
> is reached, we get the address of next RX ring descriptor
> from the NIC driver, arm the monitoring hardware, and
> enter a power-optimized state. We will then wake up when
> either a timeout happens, or a write happens (or generally
> whenever CPU feels like waking up - this is platform-
> specific), and proceed as normal. The empty poll counter is
> reset whenever we actually get packets, so we only go to
> sleep when we know nothing is going on.
>
> Why are we putting it into ethdev as opposed to leaving
> this up to the application? Our customers specifically
> requested a way to do it wit minimal changes to the
> application code. The current approach allows to just
> flip a switch and automagically have power savings.
>
> There are certain limitations in this patchset right now:
> - Currently, only 1:1 core to queue mapping is supported,
>   meaning that each lcore must at most handle RX on a
>   single queue
> - Currently, power management is enabled per-port, not
>   per-queue
> - There is potential to greatly increase TX latency if we
>   are buffering things, and go to sleep before sending
>   packets
> - The API is not perfect and could use some improvement
>   and discussion
> - The API doesn't extend to other device types
> - The intrinsics are platform-specific, so ethdev has
>   some platform-specific code in it
> - Support was only implemented for devices using
>   net/ixgbe, net/i40e and net/ice drivers
>
> Hopefully this would generate enough feedback to clear
> a path forward!

Just for my understanding:

How/Is this solution is superior than Rx queue interrupt based scheme that
applied in l3fwd-power?

What I meant by superior here, as an example,
a)Is there any power savings in mill watt vs interrupt scheme?
b) Is there improvement on time reduction between switching from/to a
different state
(i.e how fast it can move from low power state to full power state) vs
interrupt scheme.
etc

or This just for just pushing all the logic to ethdev so that
applications can be transparent?


>
> Anatoly Burakov (6):
>   eal: add power management intrinsics
>   ethdev: add simple power management API
>   net/ixgbe: implement power management API
>   net/i40e: implement power management API
>   net/ice: implement power management API
>   app/testpmd: add command for power management on a port
>
>  app/test-pmd/cmdline.c                        |  48 +++++++
>  drivers/net/i40e/i40e_ethdev.c                |   1 +
>  drivers/net/i40e/i40e_rxtx.c                  |  23 +++
>  drivers/net/i40e/i40e_rxtx.h                  |   2 +
>  drivers/net/ice/ice_ethdev.c                  |   1 +
>  drivers/net/ice/ice_rxtx.c                    |  23 +++
>  drivers/net/ice/ice_rxtx.h                    |   2 +
>  drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
>  drivers/net/ixgbe/ixgbe_rxtx.c                |  22 +++
>  drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
>  .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
>  lib/librte_eal/x86/rte_cpuflags.c             |   2 +
>  lib/librte_ethdev/rte_ethdev.c                |  39 +++++
>  lib/librte_ethdev/rte_ethdev.h                |  70 +++++++++
>  lib/librte_ethdev/rte_ethdev_core.h           |  41 +++++-
>  lib/librte_ethdev/rte_ethdev_version.map      |   4 +
>  20 files changed, 480 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
>
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices
  2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
@ 2020-05-27 20:57   ` Stephen Hemminger
  0 siblings, 0 replies; 421+ messages in thread
From: Stephen Hemminger @ 2020-05-27 20:57 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Anatoly Burakov, dpdk-dev, David Hunt, Liang Ma

On Wed, 27 May 2020 23:03:59 +0530
Jerin Jacob <jerinjacobk@gmail.com> wrote:

> On Wed, May 27, 2020 at 10:32 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
> >
> > This patchset proposes a simple API for Ethernet drivers
> > to cause the CPU to enter a power-optimized state while
> > waiting for packets to arrive, along with a set of
> > (hopefully generic) intrinsics that facilitate that. This
> > is achieved through cooperation with the NIC driver that
> > will allow us to know address of the next NIC RX ring
> > packet descriptor, and wait for writes on it.
> >
> > On IA, this is achieved through using UMONITOR/UMWAIT
> > instructions. They are used in their raw opcode form
> > because there is no widespread compiler support for
> > them yet. Still, the API is made generic enough to
> > hopefully support other architectures, if they happen
> > to implement similar instructions.
> >
> > To achieve power savings, there is a very simple mechanism
> > used: we're counting empty polls, and if a certain threshold
> > is reached, we get the address of next RX ring descriptor
> > from the NIC driver, arm the monitoring hardware, and
> > enter a power-optimized state. We will then wake up when
> > either a timeout happens, or a write happens (or generally
> > whenever CPU feels like waking up - this is platform-
> > specific), and proceed as normal. The empty poll counter is
> > reset whenever we actually get packets, so we only go to
> > sleep when we know nothing is going on.
> >
> > Why are we putting it into ethdev as opposed to leaving
> > this up to the application? Our customers specifically
> > requested a way to do it wit minimal changes to the
> > application code. The current approach allows to just
> > flip a switch and automagically have power savings.
> >
> > There are certain limitations in this patchset right now:
> > - Currently, only 1:1 core to queue mapping is supported,
> >   meaning that each lcore must at most handle RX on a
> >   single queue
> > - Currently, power management is enabled per-port, not
> >   per-queue
> > - There is potential to greatly increase TX latency if we
> >   are buffering things, and go to sleep before sending
> >   packets
> > - The API is not perfect and could use some improvement
> >   and discussion
> > - The API doesn't extend to other device types
> > - The intrinsics are platform-specific, so ethdev has
> >   some platform-specific code in it
> > - Support was only implemented for devices using
> >   net/ixgbe, net/i40e and net/ice drivers
> >
> > Hopefully this would generate enough feedback to clear
> > a path forward!  
> 
> Just for my understanding:
> 
> How/Is this solution is superior than Rx queue interrupt based scheme that
> applied in l3fwd-power?
> 
> What I meant by superior here, as an example,
> a)Is there any power savings in mill watt vs interrupt scheme?
> b) Is there improvement on time reduction between switching from/to a
> different state
> (i.e how fast it can move from low power state to full power state) vs
> interrupt scheme.
> etc
> 
> or This just for just pushing all the logic to ethdev so that
> applications can be transparent?
> 

The interrupt scheme is going to get better power management since
the core can go to WAIT. This scheme does look interesting in theory
since it will be lower latency.

but has a number of issues:
  * changing drivers
  * can not multiplex multiple queues per core; you are assuming
    a certain threading model
  * what if thread is preempted
  * what about thread in a VM
  * platform specific: ARM and x86 have different semantics here


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
@ 2020-05-28 11:39   ` Ananyev, Konstantin
  2020-05-28 14:40     ` Burakov, Anatoly
  2020-11-02 11:09   ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-05-28 11:39 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli

Hi Anatoly,

> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.

Recently ARM guys introduced new generic API
for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
Probably would make sense to unite both APIs into something common
and HW transparent. 
Konstantin

> 
> Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
>  lib/librte_eal/x86/rte_cpuflags.c             |   2 +
>  6 files changed, 203 insertions(+)
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
> 
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..8646c4ac16
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp);
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index bc73ec2c5c..b54a2be4f6 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -59,6 +59,7 @@ generic_headers = files(
>  	'generic/rte_memcpy.h',
>  	'generic/rte_pause.h',
>  	'generic/rte_prefetch.h',
> +	'generic/rte_power_intrinsics.h',
>  	'generic/rte_rwlock.h',
>  	'generic/rte_spinlock.h',
>  	'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>  	'rte_io.h',
>  	'rte_memcpy.h',
>  	'rte_prefetch.h',
> +	'rte_power_intrinsics.h',
>  	'rte_pause.h',
>  	'rte_rtm.h',
>  	'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d1..94d6a43763 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
>  	RTE_CPUFLAG_EM64T,                  /**< EM64T */
> 
> +	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
>  	/* (EAX 80000007h) EDX features */
>  	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
> 
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..a0522400fb
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,134 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	uint64_t rflags;
> +
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +	rte_mb();
> +	if (value_mask) {
> +		const uint64_t cur_value = *(const volatile uint64_t *)p;
> +		const uint64_t masked = cur_value & value_mask;
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return 0;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
> +		/*
> +		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
> +		 * onto the stack, then pop them back into `rflags` so that
> +		 * we can read it.
> +		 */
> +		"pushf;\n"
> +		"pop %0;\n"
> +		: "=r"(rflags)
> +		: "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction. For more information about its usage,
> + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
> + * Manual.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	uint64_t rflags;
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
> +		     /*
> +		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
> +		      * onto the stack, then pop them back into `rflags` so that
> +		      * we can read it.
> +		      */
> +		     "pushf;\n"
> +		     "pop %0;\n"
> +		     : "=r"(rflags)
> +		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 30439e7951..0325c4b93b 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
>  	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
>  	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
> 
> +	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
> +
>  	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
>  	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
> 
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 2/6] ethdev: add simple power management API
  2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
@ 2020-05-28 12:15   ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-05-28 12:15 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko, Ray Kinsella,
	Neil Horman, Hunt, David, Ma, Liang J

> 
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API is limited to 1 core 1 queue use case as there is no
> coordination between queues/cores in ethdev.
> 
> The TSC timestamp is automatically calculated using current link
> speed and RX descriptor ring size, such that the sleep time is
> not longer than it would take for a NIC to fill its entire RX
> descriptor ring.
> 
> Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_ethdev/rte_ethdev.c           | 39 +++++++++++++
>  lib/librte_ethdev/rte_ethdev.h           | 70 ++++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_core.h      | 41 +++++++++++++-
>  lib/librte_ethdev/rte_ethdev_version.map |  4 ++
>  4 files changed, 152 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 8e10a6fc36..0be5ecfc11 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -16,6 +16,7 @@
>  #include <netinet/in.h>
> 
>  #include <rte_byteorder.h>
> +#include <rte_cpuflags.h>
>  #include <rte_log.h>
>  #include <rte_debug.h>
>  #include <rte_interrupts.h>
> @@ -5053,6 +5054,44 @@ rte_eth_dev_pool_ops_supported(uint16_t port_id, const char *pool)
>  	return (*dev->dev_ops->pool_ops_supported)(dev, pool);
>  }
> 
> +int
> +rte_eth_dev_power_mgmt_enable(uint16_t port_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
> +		return -ENOTSUP;
> +
> +	/* allocate memory for empty poll stats */
> +	dev->empty_poll_stats = rte_malloc_socket(NULL,
> +		sizeof(struct rte_eth_ep_stat) * RTE_MAX_QUEUES_PER_PORT,
> +		0, dev->data->numa_node);
> +
> +	if (dev->empty_poll_stats == NULL)
> +		return -ENOMEM;
> +
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
> +	return 0;
> +}
> +
> +int
> +rte_eth_dev_power_mgmt_disable(uint16_t port_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	/* rte_free ignores NULL so safe to call without checks */
> +	rte_free(dev->empty_poll_stats);
> +
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> +	return 0;
> +}
> +
>  /**
>   * A set of values to describe the possible states of a switch domain.
>   */
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index a49242bcd2..b8318f7e91 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -157,6 +157,7 @@ extern "C" {
>  #include <rte_common.h>
>  #include <rte_config.h>
>  #include <rte_ether.h>
> +#include <rte_power_intrinsics.h>
> 
>  #include "rte_ethdev_trace_fp.h"
>  #include "rte_dev_info.h"
> @@ -666,6 +667,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
>  /** Maximum nb. of vlan per mirror rule */
>  #define ETH_MIRROR_MAX_VLANS       64
> 
> +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
>  #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
>  #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
>  #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
> @@ -1490,6 +1492,16 @@ enum rte_eth_dev_state {
>  	RTE_ETH_DEV_REMOVED,
>  };
> 
> +/**
> + * Possible power managment states of an ethdev port.
> + */
> +enum rte_eth_dev_power_mgmt_state {
> +	/** Device power management is disabled. */
> +	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
> +	/** Device power management is enabled. */
> +	RTE_ETH_DEV_POWER_MGMT_ENABLED
> +};
> +
>  struct rte_eth_dev_sriov {
>  	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
>  	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
> @@ -4302,6 +4314,38 @@ __rte_experimental
>  int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
>  				       struct rte_eth_hairpin_cap *cap);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable device power management.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_eth_dev_power_mgmt_enable(uint16_t port_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Disable device power management.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_eth_dev_power_mgmt_disable(uint16_t port_id);
> +
>  #include <rte_ethdev_core.h>
> 
>  /**
> @@ -4417,6 +4461,32 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>  		} while (cb != NULL);
>  	}
>  #endif
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +			dev->empty_poll_stats[queue_id].num++;
> +			if (unlikely(dev->empty_poll_stats[queue_id].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +				volatile void *target_addr;
> +				uint64_t expected, mask;
> +				int ret;
> +
> +				/*
> +				 * get address of next descriptor in the RX
> +				 * ring for this queue, as well as expected
> +				 * value and a mask.
> +				 */
> +				ret = (*dev->dev_ops->next_rx_desc)
> +					(dev->data->rx_queues[queue_id],
> +					 &target_addr, &expected, &mask);

That makes every PMD that doesn't support next_rx_desc op to crash.
One simple way to avoid it - check in rte_eth_dev_power_mgmt_enable() that PMD
does implement ops->next_rx_desc.
Though I don't think introducing such new op is a best approach, as it implies
that PMD does have HW RX descriptor mapped into WB-type memory, and dictates 
to PMD on what it should sleep on.
Though depending on HW/SW capabilities and implementation PMD might choose to
sleep on different thing (HW doorbell, SW cond var, etc.).
Another thing - I doubt it is a good idea to pollute generic RX function with power
specific code (again, as I said above it probably wouldn't be that generic for all possible PMDs).
From my perspective we have 2 alternatives to implement such functionality:
1. Keep rte_eth_dev_power_mgmt_enable/disable(port, queue) and move actual 
    *wait_on* code into the PMD RX implementations (we probably can still have some common.      
    logic about allowed number of empty polls, max timeout to sleep, etc.).
2. Drop rte_eth_dev_power_mgmt_enable/disable and introduce explicit:
    rte_eth_dev_wait_for_packet(port, queue, timeout)  API function.
    
In both cases PMD will have a full freedom to implement *wait_on_packet* functionality 
in a most convenient way.
For 2) user would have to do some extra work himself
(count number of consecutive empty polls, call *wait_on_packet* function explicitly).
Though I think it can be easily hidden inside some wrapper API on top
of rte_eth_rx_burst()/rte_eth-dev_wait_for_packet().
Something like rte_eth_rx_burst_wait() or so.
We can have logic about allowed number of empty polls,
might be some other conditions in that top level function.
In that case changes in the user app will still be minimal. 
From other side 2) gives user explicit control on where and when to sleep,
so from my perspective it seems more straightforward and flexible.

> +				if (ret == 0)
> +					/* -1ULL is maximum value for TSC */
> +					rte_power_monitor(target_addr,
> +							  expected, mask,
> +							  0, -1ULL);
> +			}
> +		} else
> +			dev->empty_poll_stats[queue_id].num = 0;
> +	}
> 
>  	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
>  	return nb_rx;
> diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
> index 32407dd418..4e23d465f0 100644
> --- a/lib/librte_ethdev/rte_ethdev_core.h
> +++ b/lib/librte_ethdev/rte_ethdev_core.h
> @@ -603,6 +603,27 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>  	 uint16_t nb_tx_desc,
>  	 const struct rte_eth_hairpin_conf *hairpin_conf);
> 
> +/**
> + * @internal
> + * Get the next RX ring descriptor address.
> + *
> + * @param rxq
> + *   ethdev queue pointer.
> + * @param tail_desc_addr
> + *   the pointer point to descriptor address var.
> + *
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success.
> + * @retval -EINVAL
> + *   Failed to get descriptor address.
> + */
> +typedef int (*eth_next_rx_desc_t)
> +	(void *rxq, volatile void **tail_desc_addr,
> +	 uint64_t *expected, uint64_t *mask);
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -752,6 +773,8 @@ struct eth_dev_ops {
>  	/**< Set up device RX hairpin queue. */
>  	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>  	/**< Set up device TX hairpin queue. */
> +	eth_next_rx_desc_t next_rx_desc;
> +	/**< Get next RX ring descriptor address. */
>  };
> 
>  /**
> @@ -768,6 +791,14 @@ struct rte_eth_rxtx_callback {
>  	void *param;
>  };
> 
> +/**
> + * @internal
> + * Structure used to hold counters for empty poll
> + */
> +struct rte_eth_ep_stat {
> +	uint64_t num;
> +} __rte_cache_aligned;
> +
>  /**
>   * @internal
>   * The generic data structure associated with each ethernet device.
> @@ -807,8 +838,14 @@ struct rte_eth_dev {
>  	enum rte_eth_dev_state state; /**< Flag indicating the port state */
>  	void *security_ctx; /**< Context for security ops */
> 
> -	uint64_t reserved_64s[4]; /**< Reserved for future fields */
> -	void *reserved_ptrs[4];   /**< Reserved for future fields */
> +	/**< Empty poll number */
> +	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
> +	uint32_t reserved_32;
> +	uint64_t reserved_64s[3]; /**< Reserved for future fields */
> +
> +	/**< Flag indicating the port power state */
> +	struct rte_eth_ep_stat *empty_poll_stats;
> +	void *reserved_ptrs[3];   /**< Reserved for future fields */
>  } __rte_cache_aligned;
> 
>  struct rte_eth_dev_sriov;
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> index 7155056045..141361823d 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -241,4 +241,8 @@ EXPERIMENTAL {
>  	__rte_ethdev_trace_rx_burst;
>  	__rte_ethdev_trace_tx_burst;
>  	rte_flow_get_aged_flows;
> +
> +	# added in 20.08
> +	rte_eth_dev_power_mgmt_disable;
> +	rte_eth_dev_power_mgmt_enable;
>  };
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 11:39   ` Ananyev, Konstantin
@ 2020-05-28 14:40     ` Burakov, Anatoly
  2020-05-28 14:58       ` Bruce Richardson
  2020-05-28 15:38       ` Ananyev, Konstantin
  0 siblings, 2 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-05-28 14:40 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli

On 28-May-20 12:39 PM, Ananyev, Konstantin wrote:
> Hi Anatoly,
> 
>>
>> Add two new power management intrinsics, and provide an implementation
>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>> are implemented as raw byte opcodes because there is not yet widespread
>> compiler support for these instructions.
>>
>> The power management instructions provide an architecture-specific
>> function to either wait until a specified TSC timestamp is reached, or
>> optionally wait until either a TSC timestamp is reached or a memory
>> location is written to. The monitor function also provides an optional
>> comparison, to avoid sleeping when the expected write has already
>> happened, and no more writes are expected.
> 
> Recently ARM guys introduced new generic API
> for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> Probably would make sense to unite both APIs into something common
> and HW transparent.
> Konstantin

Hi Konstantin,

That's not really similar purpose. This is monitoring a cacheline for 
writes, not waiting on a specific value. The "expected" value is there 
as basically a hack to get around the race condition due to the fact 
that by the time you enter monitoring state, the write you're waiting 
for may have already happened.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 14:40     ` Burakov, Anatoly
@ 2020-05-28 14:58       ` Bruce Richardson
  2020-05-28 15:38       ` Ananyev, Konstantin
  1 sibling, 0 replies; 421+ messages in thread
From: Bruce Richardson @ 2020-05-28 14:58 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Ananyev, Konstantin, dev, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli

On Thu, May 28, 2020 at 03:40:18PM +0100, Burakov, Anatoly wrote:
> On 28-May-20 12:39 PM, Ananyev, Konstantin wrote:
> > Hi Anatoly,
> > 
> > > 
> > > Add two new power management intrinsics, and provide an implementation
> > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > are implemented as raw byte opcodes because there is not yet widespread
> > > compiler support for these instructions.
> > > 
> > > The power management instructions provide an architecture-specific
> > > function to either wait until a specified TSC timestamp is reached, or
> > > optionally wait until either a TSC timestamp is reached or a memory
> > > location is written to. The monitor function also provides an optional
> > > comparison, to avoid sleeping when the expected write has already
> > > happened, and no more writes are expected.
> > 
> > Recently ARM guys introduced new generic API
> > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > Probably would make sense to unite both APIs into something common
> > and HW transparent.
> > Konstantin
> 
> Hi Konstantin,
> 
> That's not really similar purpose. This is monitoring a cacheline for
> writes, not waiting on a specific value. The "expected" value is there as
> basically a hack to get around the race condition due to the fact that by
> the time you enter monitoring state, the write you're waiting for may have
> already happened.
> 
Rather than the "expected" value, is it not more useful for a general API
to check for the existing value? Since we are awaiting writes, we may not
know what new value will be written, but we do know what the value is now,
and can just check that it's not been changed.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 14:40     ` Burakov, Anatoly
  2020-05-28 14:58       ` Bruce Richardson
@ 2020-05-28 15:38       ` Ananyev, Konstantin
  2020-05-29  6:56         ` Jerin Jacob
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-05-28 15:38 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli


> > Hi Anatoly,
> >
> >>
> >> Add two new power management intrinsics, and provide an implementation
> >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >> are implemented as raw byte opcodes because there is not yet widespread
> >> compiler support for these instructions.
> >>
> >> The power management instructions provide an architecture-specific
> >> function to either wait until a specified TSC timestamp is reached, or
> >> optionally wait until either a TSC timestamp is reached or a memory
> >> location is written to. The monitor function also provides an optional
> >> comparison, to avoid sleeping when the expected write has already
> >> happened, and no more writes are expected.
> >
> > Recently ARM guys introduced new generic API
> > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > Probably would make sense to unite both APIs into something common
> > and HW transparent.
> > Konstantin
> 
> Hi Konstantin,
> 
> That's not really similar purpose. This is monitoring a cacheline for
> writes, not waiting on a specific value. 

I understand that.

> The "expected" value is there
> as basically a hack to get around the race condition due to the fact
> that by the time you enter monitoring state, the write you're waiting
> for may have already happened.

AFAIK, current rte_wait_until_equal_* does pretty much the same thing:

LDXR memaddr, $reg  // an address to monitor for
if ($reg != expected_value)
   SEVL      //     arm monitor
   do {
       WFE     //      waits for write to that memory address  
       LDXR memaddr, $reg
   } while ($reg != expected_value);   
 
Looks pretty similar to what rte_power_monitor() does,
except you don't have a loop for checking the new value.
Plus rte_power_monitor() provides extra options to the user - 
timestamp and power save mode to enter.
Also I don't know what is the granularity of such events on ARM,
is it a cache-line or more/less.
Might be ARM people can comment/correct me here. 
Konstantin

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 15:38       ` Ananyev, Konstantin
@ 2020-05-29  6:56         ` Jerin Jacob
  2020-06-02 10:15           ` Ananyev, Konstantin
  2020-06-03  6:22           ` Honnappa Nagarahalli
  0 siblings, 2 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-05-29  6:56 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma,
	Liang J, Honnappa.Nagarahalli

On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
> > > Hi Anatoly,
> > >
> > >>
> > >> Add two new power management intrinsics, and provide an implementation
> > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > >> are implemented as raw byte opcodes because there is not yet widespread
> > >> compiler support for these instructions.
> > >>
> > >> The power management instructions provide an architecture-specific
> > >> function to either wait until a specified TSC timestamp is reached, or
> > >> optionally wait until either a TSC timestamp is reached or a memory
> > >> location is written to. The monitor function also provides an optional
> > >> comparison, to avoid sleeping when the expected write has already
> > >> happened, and no more writes are expected.
> > >
> > > Recently ARM guys introduced new generic API
> > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > > Probably would make sense to unite both APIs into something common
> > > and HW transparent.
> > > Konstantin
> >
> > Hi Konstantin,
> >
> > That's not really similar purpose. This is monitoring a cacheline for
> > writes, not waiting on a specific value.
>
> I understand that.
>
> > The "expected" value is there
> > as basically a hack to get around the race condition due to the fact
> > that by the time you enter monitoring state, the write you're waiting
> > for may have already happened.
>
> AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
>
> LDXR memaddr, $reg  // an address to monitor for
> if ($reg != expected_value)
>    SEVL      //     arm monitor
>    do {
>        WFE     //      waits for write to that memory address
>        LDXR memaddr, $reg
>    } while ($reg != expected_value);
>
> Looks pretty similar to what rte_power_monitor() does,
> except you don't have a loop for checking the new value.
> Plus rte_power_monitor() provides extra options to the user -
> timestamp and power save mode to enter.
> Also I don't know what is the granularity of such events on ARM,
> is it a cache-line or more/less.

As I understand it, Granularity is per the cache-line.
ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power
state until the cache line is written.

But I see UMONITOR bit different, Where _without_ other core signaling
to wakeup from wait state,
it can wake on TSC expiry. I think, that's is the main primitive on
this feature. Right?

WFE can also wake based on Timer stream events(kind of TSC in x86
analogy) but it has a configuration
bit that needs to allow for this scheme in userspace(EL0) or not?
defined by EL1(Linux kernel).
I am planning to spend time on this after understanding the value
addition of the feature/usecase[1]
[1]
http://mails.dpdk.org/archives/dev/2020-May/168888.html





> Might be ARM people can comment/correct me here.
> Konstantin

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-29  6:56         ` Jerin Jacob
@ 2020-06-02 10:15           ` Ananyev, Konstantin
  2020-06-03  6:22           ` Honnappa Nagarahalli
  1 sibling, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-06-02 10:15 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma,
	Liang J, Honnappa.Nagarahalli


> 
> On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> > > > Hi Anatoly,
> > > >
> > > >>
> > > >> Add two new power management intrinsics, and provide an implementation
> > > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > >> are implemented as raw byte opcodes because there is not yet widespread
> > > >> compiler support for these instructions.
> > > >>
> > > >> The power management instructions provide an architecture-specific
> > > >> function to either wait until a specified TSC timestamp is reached, or
> > > >> optionally wait until either a TSC timestamp is reached or a memory
> > > >> location is written to. The monitor function also provides an optional
> > > >> comparison, to avoid sleeping when the expected write has already
> > > >> happened, and no more writes are expected.
> > > >
> > > > Recently ARM guys introduced new generic API
> > > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > Probably would make sense to unite both APIs into something common
> > > > and HW transparent.
> > > > Konstantin
> > >
> > > Hi Konstantin,
> > >
> > > That's not really similar purpose. This is monitoring a cacheline for
> > > writes, not waiting on a specific value.
> >
> > I understand that.
> >
> > > The "expected" value is there
> > > as basically a hack to get around the race condition due to the fact
> > > that by the time you enter monitoring state, the write you're waiting
> > > for may have already happened.
> >
> > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> >
> > LDXR memaddr, $reg  // an address to monitor for
> > if ($reg != expected_value)
> >    SEVL      //     arm monitor
> >    do {
> >        WFE     //      waits for write to that memory address
> >        LDXR memaddr, $reg
> >    } while ($reg != expected_value);
> >
> > Looks pretty similar to what rte_power_monitor() does,
> > except you don't have a loop for checking the new value.
> > Plus rte_power_monitor() provides extra options to the user -
> > timestamp and power save mode to enter.
> > Also I don't know what is the granularity of such events on ARM,
> > is it a cache-line or more/less.
> 
> As I understand it, Granularity is per the cache-line.
> ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power
> state until the cache line is written.
> 
> But I see UMONITOR bit different, Where _without_ other core signaling
> to wakeup from wait state,
> it can wake on TSC expiry. I think, that's is the main primitive on
> this feature. Right?
> 
> WFE can also wake based on Timer stream events(kind of TSC in x86
> analogy) but it has a configuration
> bit that needs to allow for this scheme in userspace(EL0) or not?
> defined by EL1(Linux kernel).
> I am planning to spend time on this after understanding the value
> addition of the feature/usecase[1]
> [1]
> http://mails.dpdk.org/archives/dev/2020-May/168888.html
> 

Ok, if there is a consensus to keep these two APIs disjoint for now -
I wouldn't insist.
Konstantin


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-29  6:56         ` Jerin Jacob
  2020-06-02 10:15           ` Ananyev, Konstantin
@ 2020-06-03  6:22           ` Honnappa Nagarahalli
  2020-06-03  6:31             ` Jerin Jacob
  1 sibling, 1 reply; 421+ messages in thread
From: Honnappa Nagarahalli @ 2020-06-03  6:22 UTC (permalink / raw)
  To: Jerin Jacob, Ananyev, Konstantin
  Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma,
	Liang J, nd, nd

<snip>

> 
> On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> > > > Hi Anatoly,
> > > >
> > > >>
> > > >> Add two new power management intrinsics, and provide an
> > > >> implementation in eal/x86 based on UMONITOR/UMWAIT instructions.
> > > >> The instructions are implemented as raw byte opcodes because
> > > >> there is not yet widespread compiler support for these instructions.
> > > >>
> > > >> The power management instructions provide an
> > > >> architecture-specific function to either wait until a specified
> > > >> TSC timestamp is reached, or optionally wait until either a TSC
> > > >> timestamp is reached or a memory location is written to. The
> > > >> monitor function also provides an optional comparison, to avoid
> > > >> sleeping when the expected write has already happened, and no more
> writes are expected.
> > > >
> > > > Recently ARM guys introduced new generic API for similar (as I
> > > > understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > Probably would make sense to unite both APIs into something common
> > > > and HW transparent.
> > > > Konstantin
> > >
> > > Hi Konstantin,
> > >
> > > That's not really similar purpose. This is monitoring a cacheline
> > > for writes, not waiting on a specific value.
> >
> > I understand that.
> >
> > > The "expected" value is there
> > > as basically a hack to get around the race condition due to the fact
> > > that by the time you enter monitoring state, the write you're
> > > waiting for may have already happened.
> >
> > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> >
> > LDXR memaddr, $reg  // an address to monitor for if ($reg !=
> > expected_value)
> >    SEVL      //     arm monitor
> >    do {
> >        WFE     //      waits for write to that memory address
> >        LDXR memaddr, $reg
> >    } while ($reg != expected_value);
> >
> > Looks pretty similar to what rte_power_monitor() does, except you
> > don't have a loop for checking the new value.
> > Plus rte_power_monitor() provides extra options to the user -
> > timestamp and power save mode to enter.
> > Also I don't know what is the granularity of such events on ARM, is it
> > a cache-line or more/less.
> 
> As I understand it, Granularity is per the cache-line.
> ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until
> the cache line is written.
Architecture allows for 16B to 2048B space. Typically, implementations use cache-line granularity.

> 
> But I see UMONITOR bit different, Where _without_ other core signaling to
> wakeup from wait state, it can wake on TSC expiry. I think, that's is the main
> primitive on this feature. Right?
> 
> WFE can also wake based on Timer stream events(kind of TSC in x86
> analogy) but it has a configuration
> bit that needs to allow for this scheme in userspace(EL0) or not?
> defined by EL1(Linux kernel).
Timer stream events are not per CPU core. They are system wide streams.

> I am planning to spend time on this after understanding the value addition of
> the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020-
> May/168888.html
> 
> 
> 
> 
> 
> > Might be ARM people can comment/correct me here.
> > Konstantin

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-06-03  6:22           ` Honnappa Nagarahalli
@ 2020-06-03  6:31             ` Jerin Jacob
  0 siblings, 0 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-06-03  6:31 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Ananyev, Konstantin, Burakov, Anatoly, dev, Richardson, Bruce,
	Hunt, David, Ma, Liang J, nd

On Wed, Jun 3, 2020 at 11:53 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> <snip>
>
> >
> > On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> > >
> > >
> > > > > Hi Anatoly,
> > > > >
> > > > >>
> > > > >> Add two new power management intrinsics, and provide an
> > > > >> implementation in eal/x86 based on UMONITOR/UMWAIT instructions.
> > > > >> The instructions are implemented as raw byte opcodes because
> > > > >> there is not yet widespread compiler support for these instructions.
> > > > >>
> > > > >> The power management instructions provide an
> > > > >> architecture-specific function to either wait until a specified
> > > > >> TSC timestamp is reached, or optionally wait until either a TSC
> > > > >> timestamp is reached or a memory location is written to. The
> > > > >> monitor function also provides an optional comparison, to avoid
> > > > >> sleeping when the expected write has already happened, and no more
> > writes are expected.
> > > > >
> > > > > Recently ARM guys introduced new generic API for similar (as I
> > > > > understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > > Probably would make sense to unite both APIs into something common
> > > > > and HW transparent.
> > > > > Konstantin
> > > >
> > > > Hi Konstantin,
> > > >
> > > > That's not really similar purpose. This is monitoring a cacheline
> > > > for writes, not waiting on a specific value.
> > >
> > > I understand that.
> > >
> > > > The "expected" value is there
> > > > as basically a hack to get around the race condition due to the fact
> > > > that by the time you enter monitoring state, the write you're
> > > > waiting for may have already happened.
> > >
> > > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> > >
> > > LDXR memaddr, $reg  // an address to monitor for if ($reg !=
> > > expected_value)
> > >    SEVL      //     arm monitor
> > >    do {
> > >        WFE     //      waits for write to that memory address
> > >        LDXR memaddr, $reg
> > >    } while ($reg != expected_value);
> > >
> > > Looks pretty similar to what rte_power_monitor() does, except you
> > > don't have a loop for checking the new value.
> > > Plus rte_power_monitor() provides extra options to the user -
> > > timestamp and power save mode to enter.
> > > Also I don't know what is the granularity of such events on ARM, is it
> > > a cache-line or more/less.
> >
> > As I understand it, Granularity is per the cache-line.
> > ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until
> > the cache line is written.
> Architecture allows for 16B to 2048B space. Typically, implementations use cache-line granularity.
>
> >
> > But I see UMONITOR bit different, Where _without_ other core signaling to
> > wakeup from wait state, it can wake on TSC expiry. I think, that's is the main
> > primitive on this feature. Right?
> >
> > WFE can also wake based on Timer stream events(kind of TSC in x86
> > analogy) but it has a configuration
> > bit that needs to allow for this scheme in userspace(EL0) or not?
> > defined by EL1(Linux kernel).
> Timer stream events are not per CPU core. They are system wide streams.

We may not need per core support to implement this use case.

I think, currently, kernel configured to have a WFE signal on every
100us.(System-wide).

do while{} loop can check if it is passing the requested timestamp after WFE.
But minimum granularity will be 100us.(i.e 100us worth of ticks)


>
> > I am planning to spend time on this after understanding the value addition of
> > the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020-
> > May/168888.html
> >
> >
> >
> >
> >
> > > Might be ARM people can comment/correct me here.
> > > Konstantin

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (6 preceding siblings ...)
  2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
@ 2020-08-11 10:27 ` Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
                     ` (6 more replies)
  7 siblings, 7 replies; 421+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 138 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 207 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 000000000..8646c4ac1
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd0902795..3a12e87e1 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2f..494a8142a 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d..94d6a4376 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
 	RTE_CPUFLAG_EM64T,                  /**< EM64T */
 
+	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
 	/* (EAX 80000007h) EDX features */
 	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
 
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 000000000..af8aa9459
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,138 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+	rte_mb();
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e795..0325c4b93 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
@ 2020-08-11 10:27   ` Liang Ma
  2020-08-13 18:11     ` Liang, Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API Liang Ma
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 queue use case as there is no
coordination between queues/cores in ethdev.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to releaf the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 config/common_base                       |   4 +-
 lib/Makefile                             |   1 +
 lib/librte_ethdev/Makefile               |   2 +-
 lib/librte_ethdev/meson.build            |   2 +-
 lib/librte_ethdev/rte_ethdev.c           | 198 +++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h           |  59 +++++++
 lib/librte_ethdev/rte_ethdev_core.h      |  43 ++++-
 lib/librte_ethdev/rte_ethdev_version.map |   4 +
 lib/meson.build                          |   5 +-
 mk/rte.app.mk                            |   2 +-
 10 files changed, 311 insertions(+), 9 deletions(-)

diff --git a/config/common_base b/config/common_base
index f76585f16..e0948f0cb 100644
--- a/config/common_base
+++ b/config/common_base
@@ -155,7 +155,7 @@ CONFIG_RTE_MAX_ETHPORTS=32
 CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
 CONFIG_RTE_LIBRTE_IEEE1588=n
 CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
-CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
+CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=n
 CONFIG_RTE_ETHDEV_PROFILE_WITH_VTUNE=n
 
 #
@@ -978,7 +978,7 @@ CONFIG_RTE_LIBRTE_ACL_DEBUG=n
 #
 # Compile librte_power
 #
-CONFIG_RTE_LIBRTE_POWER=n
+CONFIG_RTE_LIBRTE_POWER=y
 CONFIG_RTE_LIBRTE_POWER_DEBUG=n
 CONFIG_RTE_MAX_LCORE_FREQS=64
 
diff --git a/lib/Makefile b/lib/Makefile
index 8f5b68a2d..87646698a 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -28,6 +28,7 @@ DEPDIRS-librte_ethdev := librte_net librte_eal librte_mempool librte_ring
 DEPDIRS-librte_ethdev += librte_mbuf
 DEPDIRS-librte_ethdev += librte_kvargs
 DEPDIRS-librte_ethdev += librte_meter
+DEPDIRS-librte_ethdev += librte_power
 DIRS-$(CONFIG_RTE_LIBRTE_BBDEV) += librte_bbdev
 DEPDIRS-librte_bbdev := librte_eal librte_mempool librte_mbuf
 DIRS-$(CONFIG_RTE_LIBRTE_CRYPTODEV) += librte_cryptodev
diff --git a/lib/librte_ethdev/Makefile b/lib/librte_ethdev/Makefile
index 47747150b..6a4ce14cf 100644
--- a/lib/librte_ethdev/Makefile
+++ b/lib/librte_ethdev/Makefile
@@ -11,7 +11,7 @@ LIB = librte_ethdev.a
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_net -lrte_eal -lrte_mempool -lrte_ring
-LDLIBS += -lrte_mbuf -lrte_kvargs -lrte_meter -lrte_telemetry
+LDLIBS += -lrte_mbuf -lrte_kvargs -lrte_meter -lrte_telemetry -lrte_power
 
 EXPORT_MAP := rte_ethdev_version.map
 
diff --git a/lib/librte_ethdev/meson.build b/lib/librte_ethdev/meson.build
index 8fc24e8c8..e09e2395e 100644
--- a/lib/librte_ethdev/meson.build
+++ b/lib/librte_ethdev/meson.build
@@ -27,4 +27,4 @@ headers = files('rte_ethdev.h',
 	'rte_tm.h',
 	'rte_tm_driver.h')
 
-deps += ['net', 'kvargs', 'meter', 'telemetry']
+deps += ['net', 'kvargs', 'meter', 'telemetry', 'power']
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 7858ad5f1..b43de88ce 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -16,6 +16,7 @@
 #include <netinet/in.h>
 
 #include <rte_byteorder.h>
+#include <rte_cpuflags.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_interrupts.h>
@@ -39,6 +40,7 @@
 #include <rte_class.h>
 #include <rte_ether.h>
 #include <rte_telemetry.h>
+#include <rte_power.h>
 
 #include "rte_ethdev_trace.h"
 #include "rte_ethdev.h"
@@ -185,6 +187,100 @@ enum {
 	STAT_QMAP_RX
 };
 
+
+static uint16_t
+rte_ethdev_pmgmt_umait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+			dev->empty_poll_stats[qidx].num++;
+			if (unlikely(dev->empty_poll_stats[qidx].num >
+					ETH_EMPTYPOLL_MAX)) {
+				volatile void *target_addr;
+				uint64_t expected, mask;
+				uint16_t ret;
+
+				/*
+				 * get address of next descriptor in the RX
+				 * ring for this queue, as well as expected
+				 * value and a mask.
+				 */
+				ret = (*dev->dev_ops->next_rx_desc)
+					(dev->data->rx_queues[qidx],
+					 &target_addr, &expected, &mask);
+				if (ret == 0)
+					/* -1ULL is maximum value for TSC */
+					rte_power_monitor(target_addr,
+							  expected, mask,
+							  0, -1ULL);
+			}
+		} else
+			dev->empty_poll_stats[qidx].num = 0;
+	}
+
+	return 0;
+}
+
+static uint16_t
+rte_ethdev_pmgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	int i;
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+
+			dev->empty_poll_stats[qidx].num++;
+
+			if (unlikely(dev->empty_poll_stats[qidx].num >
+					ETH_EMPTYPOLL_MAX)) {
+
+				for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
+					rte_pause();
+
+			}
+		} else
+			dev->empty_poll_stats[qidx].num = 0;
+	}
+
+	return 0;
+}
+
+static uint16_t
+rte_ethdev_pmgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+			dev->empty_poll_stats[qidx].num++;
+			if (unlikely(dev->empty_poll_stats[qidx].num >
+					ETH_EMPTYPOLL_MAX)) {
+
+				/*scale down freq */
+				rte_power_freq_min(rte_lcore_id());
+
+			}
+		} else {
+			dev->empty_poll_stats[qidx].num = 0;
+			/* scal up freq */
+			rte_power_freq_max(rte_lcore_id());
+		}
+	}
+
+	return 0;
+}
+
 int
 rte_eth_iterator_init(struct rte_dev_iterator *iter, const char *devargs_str)
 {
@@ -5113,6 +5209,108 @@ rte_eth_dev_pool_ops_supported(uint16_t port_id, const char *pool)
 	return (*dev->dev_ops->pool_ops_supported)(dev, pool);
 }
 
+int
+rte_eth_dev_power_mgmt_enable(unsigned int lcore_id,
+			      uint16_t port_id,
+			 enum rte_eth_dev_power_mgmt_cb_mode mode)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+						  sizeof(struct rte_eth_ep_stat)
+						  * RTE_MAX_QUEUES_PER_PORT,
+						  0, dev->data->numa_node);
+
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
+		return -EINVAL;
+
+	dev->cb_mode = mode;
+
+	switch (mode) {
+
+	case RTE_ETH_DEV_POWER_MGMT_CB_UMWAIT:
+
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+			return -ENOTSUP;
+
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_ethdev_pmgmt_umait, NULL);
+		break;
+
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+
+		/* init scale freq */
+		if (rte_power_init(lcore_id))
+			return -EINVAL;
+
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+					rte_ethdev_pmgmt_scalefreq, NULL);
+		break;
+
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_ethdev_pmgmt_pause, NULL);
+		break;
+
+	}
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_eth_dev_power_mgmt_disable(unsigned int lcore_id,
+			       uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/*add flag check */
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)  {
+		/* rte_free ignores NULL so safe to call without checks */
+		rte_free(dev->empty_poll_stats);
+
+		switch (dev->cb_mode) {
+
+		case RTE_ETH_DEV_POWER_MGMT_CB_UMWAIT:
+
+		case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+
+			rte_eth_remove_rx_callback(port_id, 0,
+						   dev->cur_pwr_cb);
+
+			break;
+
+		case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+
+			rte_power_freq_max(lcore_id);
+
+			rte_eth_remove_rx_callback(port_id, 0,
+						   dev->cur_pwr_cb);
+
+			if (rte_power_exit(lcore_id))
+				return -EINVAL;
+
+			break;
+		}
+
+		dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+
+	}
+	return 0;
+}
+
 /**
  * A set of values to describe the possible states of a switch domain.
  */
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 57e4a6ca5..6858c0338 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1603,6 +1605,25 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED,
+};
+
+enum rte_eth_dev_power_mgmt_cb_mode {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_CB_UMWAIT = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
+	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
@@ -4415,6 +4436,40 @@ __rte_experimental
 int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
 				       struct rte_eth_hairpin_cap *cap);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_enable(unsigned int lcore_id,
+				  uint16_t port_id,
+				  enum rte_eth_dev_power_mgmt_cb_mode mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
+
 #include <rte_ethdev_core.h>
 
 /**
@@ -4535,6 +4590,7 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 	return nb_rx;
 }
 
+
 /**
  * Get the number of used descriptors of a rx queue
  *
@@ -4993,6 +5049,9 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
 	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
 }
 
+
+
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd41..7d6d85ddc 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,27 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ *
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +773,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +791,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +838,16 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
+	uint32_t reserved_32;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	const struct rte_eth_rxtx_callback *cur_pwr_cb;
+	void *reserved_ptrs[3];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index 1212a17d3..4d5b63a5b 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -241,6 +241,10 @@ EXPERIMENTAL {
 	__rte_ethdev_trace_rx_burst;
 	__rte_ethdev_trace_tx_burst;
 	rte_flow_get_aged_flows;
+
+	# added in 20.08
+	rte_eth_dev_power_mgmt_disable;
+	rte_eth_dev_power_mgmt_enable;
 };
 
 INTERNAL {
diff --git a/lib/meson.build b/lib/meson.build
index 3852c0156..54cc0db7d 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -14,17 +14,18 @@ libraries = [
 	'eal', # everything depends on eal
 	'ring',
 	'rcu', # rcu depends on ring
+	'timer',   # eventdev depends on this
+	'power',   # eventdev depends on this
 	'mempool', 'mbuf', 'net', 'meter', 'ethdev', 'pci', # core
 	'cmdline',
 	'metrics', # bitrate/latency stats depends on this
 	'hash',    # efd depends on this
-	'timer',   # eventdev depends on this
 	'acl', 'bbdev', 'bitratestats', 'cfgfile',
 	'compressdev', 'cryptodev',
 	'distributor', 'efd', 'eventdev',
 	'gro', 'gso', 'ip_frag', 'jobstats',
 	'kni', 'latencystats', 'lpm', 'member',
-	'power', 'pdump', 'rawdev', 'regexdev',
+	'pdump', 'rawdev', 'regexdev',
 	'rib', 'reorder', 'sched', 'security', 'stack', 'vhost',
 	# ipsec lib depends on net, crypto and security
 	'ipsec',
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index a54425997..b87abb26e 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -58,7 +58,6 @@ endif
 _LDLIBS-$(CONFIG_RTE_LIBRTE_METRICS)        += --no-whole-archive
 _LDLIBS-$(CONFIG_RTE_LIBRTE_BITRATE)        += -lrte_bitratestats
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LATENCY_STATS)  += -lrte_latencystats
-_LDLIBS-$(CONFIG_RTE_LIBRTE_POWER)          += -lrte_power
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_EFD)            += -lrte_efd
 _LDLIBS-$(CONFIG_RTE_LIBRTE_BPF)            += -lrte_bpf
@@ -80,6 +79,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_KVARGS)         += -lrte_kvargs
 _LDLIBS-y                                   += -lrte_telemetry
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MBUF)           += -lrte_mbuf
 _LDLIBS-$(CONFIG_RTE_LIBRTE_NET)            += -lrte_net
+_LDLIBS-$(CONFIG_RTE_LIBRTE_POWER)          += -lrte_power
 _LDLIBS-$(CONFIG_RTE_LIBRTE_ETHER)          += -lrte_ethdev
 _LDLIBS-$(CONFIG_RTE_LIBRTE_BBDEV)          += -lrte_bbdev
 _LDLIBS-$(CONFIG_RTE_LIBRTE_CRYPTODEV)      += -lrte_cryptodev
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
@ 2020-08-11 10:27   ` Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 4/5] net/i40e: " Liang Ma
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index fd0cb9b0e..618fc1573 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -592,6 +592,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf513..d1d015dea 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b2..826f451be 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC v2 4/5] net/i40e: implement power management API
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API Liang Ma
@ 2020-08-11 10:27   ` Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 5/5] net/ice: " Liang Ma
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 05d5f2861..f0797c3cb 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -515,6 +515,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index fe7f9200c..9d7eea8ae 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160..bfda5b6ad 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC v2 5/5] net/ice: implement power management API
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (2 preceding siblings ...)
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 4/5] net/i40e: " Liang Ma
@ 2020-08-11 10:27   ` Liang Ma
  2020-08-13 18:04   ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang, Ma
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 7dd3fcd27..7a636cd11 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -212,6 +212,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index cc3139042..ce7e025b6 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d0..7eb6fa904 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (3 preceding siblings ...)
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 5/5] net/ice: " Liang Ma
@ 2020-08-13 18:04   ` Liang, Ma
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  6 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-08-13 18:04 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

On 11 Aug 11:27, Liang Ma wrote:
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> --- 
<snip> 
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d..94d6a4376 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
>  	RTE_CPUFLAG_EM64T,                  /**< EM64T */
>  
> +	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
need re-work the order to avoid breaking ABI
>  	/* (EAX 80000007h) EDX features */
>  	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
>  
</snip>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
@ 2020-08-13 18:11     ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-08-13 18:11 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

On 11 Aug 11:27, Liang Ma wrote:
<snip>
> +static uint16_t
> +rte_ethdev_pmgmt_umait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +			dev->empty_poll_stats[qidx].num++;
> +			if (unlikely(dev->empty_poll_stats[qidx].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +				volatile void *target_addr;
> +				uint64_t expected, mask;
> +				uint16_t ret;
> +
> +				/*
> +				 * get address of next descriptor in the RX
> +				 * ring for this queue, as well as expected
> +				 * value and a mask.
> +				 */
> +				ret = (*dev->dev_ops->next_rx_desc)
> +					(dev->data->rx_queues[qidx],
> +					 &target_addr, &expected, &mask);
> +				if (ret == 0)
> +					/* -1ULL is maximum value for TSC */
> +					rte_power_monitor(target_addr,
> +							  expected, mask,
> +							  0, -1ULL);
> +			}
> +		} else
> +			dev->empty_poll_stats[qidx].num = 0;
> +	}
> +
> +	return 0;
should return nb_rx here. that's fixed in v3.
> +}
> +
> +static uint16_t
> +rte_ethdev_pmgmt_pause(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	int i;
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +
> +			dev->empty_poll_stats[qidx].num++;
> +
> +			if (unlikely(dev->empty_poll_stats[qidx].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +
> +				for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
> +					rte_pause();
> +
> +			}
> +		} else
> +			dev->empty_poll_stats[qidx].num = 0;
> +	}
> +
> +	return 0;
should return  nb_rx here. that's fixed in v3.
> +}
> +
> +static uint16_t
> +rte_ethdev_pmgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +			dev->empty_poll_stats[qidx].num++;
> +			if (unlikely(dev->empty_poll_stats[qidx].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +
> +				/*scale down freq */
> +				rte_power_freq_min(rte_lcore_id());
> +
> +			}
> +		} else {
> +			dev->empty_poll_stats[qidx].num = 0;
> +			/* scal up freq */
> +			rte_power_freq_max(rte_lcore_id());
> +		}
> +	}
> +
> +	return 0;
should return  nb_rx here. that's fixed in v3.
> +}
> +
</snip>

 -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 1/6] eal: add power management intrinsics
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (4 preceding siblings ...)
  2020-08-13 18:04   ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang, Ma
@ 2020-09-03 16:06   ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
                       ` (4 more replies)
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  6 siblings, 5 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:06 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   2 +
 .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 213 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..cd7f8070ac
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..5041a830a7 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	/**< UMWAIT/TPAUSE Instructions */
+	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..6dd1cdc939
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback Liang Ma
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple API allow ethdev get the last
available queue descriptor address from PMD.
Also include internal structure update.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.h      | 22 ++++++++++++++
 lib/librte_ethdev/rte_ethdev_core.h | 46 +++++++++++++++++++++++++++--
 2 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 70295d7ab7..d9312d3e11 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1602,6 +1604,26 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED,
+};
+
+enum rte_eth_dev_power_mgmt_cb_mode {
+	/** WAIT callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd418..16e54bb4e4 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,30 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ * @param expected
+ *   the pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   the pointer point to comparison bitmask for the expected value.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +776,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +794,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +841,16 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	/**< Power mgmt Callback mode */
+	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	const struct rte_eth_rxtx_callback *cur_pwr_cb;
+	void *reserved_ptrs[2];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 port 1 queue use case as there is
no coordination between queues/cores in ethdev. 1 port map to multiple
core will be supported in next version.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to releaf the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_power/meson.build           |   3 +-
 lib/librte_power/rte_power.h           |  38 +++++
 lib/librte_power/rte_power_pmd_mgmt.c  | 184 +++++++++++++++++++++++++
 lib/librte_power/rte_power_version.map |   4 +
 4 files changed, 228 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c

diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..44b01afce2 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
 headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power.h b/lib/librte_power/rte_power.h
index bbbde4dfb4..06d5a9984f 100644
--- a/lib/librte_power/rte_power.h
+++ b/lib/librte_power/rte_power.h
@@ -14,6 +14,7 @@
 #include <rte_byteorder.h>
 #include <rte_log.h>
 #include <rte_string_fns.h>
+#include <rte_ethdev.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -97,6 +98,43 @@ int rte_power_init(unsigned int lcore_id);
  */
 int rte_power_exit(unsigned int lcore_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function mode.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+				  uint16_t port_id,
+				  enum rte_eth_dev_power_mgmt_cb_mode mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
+
 /**
  * Get the available frequencies of a specific lcore.
  * Function pointer definition. Review each environments
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..a445153ede
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+
+#include "rte_power.h"
+
+
+
+static uint16_t
+rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = (*dev->dev_ops->next_rx_desc)
+				(dev->data->rx_queues[qidx],
+				 &target_addr, &expected, &mask);
+			if (ret == 0)
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor(target_addr,
+						  expected, mask,
+						  0, -1ULL);
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	int i;
+
+	if (unlikely(nb_rx == 0)) {
+
+		dev->empty_poll_stats[qidx].num++;
+
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
+				rte_pause();
+
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			/*scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+
+		}
+	} else {
+		dev->empty_poll_stats[qidx].num = 0;
+		/* scal up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+			uint16_t port_id,
+			enum rte_eth_dev_power_mgmt_cb_mode mode)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
+		return -EINVAL;
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+						  sizeof(struct rte_eth_ep_stat)
+						  * RTE_MAX_QUEUES_PER_PORT,
+						  0, dev->data->numa_node);
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	switch (mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+			return -ENOTSUP;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_umwait, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		/* init scale freq */
+		if (rte_power_init(lcore_id))
+			return -EINVAL;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+					rte_power_mgmt_scalefreq, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_pause, NULL);
+		break;
+	}
+
+	dev->cb_mode = mode;
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_power_pmd_mgmt_disable(unsigned int lcore_id,
+				uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/*add flag check */
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
+		return -EINVAL;
+
+	/* rte_free ignores NULL so safe to call without checks */
+	rte_free(dev->empty_poll_stats);
+
+	switch (dev->cb_mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		if (rte_power_exit(lcore_id))
+			return -EINVAL;
+		break;
+	}
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+	dev->cur_pwr_cb = NULL;
+	dev->cb_mode = 0;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 00ee5753e2..ade83cfd4f 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.08
+	rte_power_pmd_mgmt_disable;
+	rte_power_pmd_mgmt_enable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 6/6] net/ice: " Liang Ma
  4 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index fd0cb9b0e2..618fc15732 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -592,6 +592,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf5137..d1d015deae 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b22..826f451bee 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: implement power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
                       ` (2 preceding siblings ...)
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 6/6] net/ice: " Liang Ma
  4 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 11c02b1888..94e9298d7c 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -517,6 +517,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index fe7f9200c1..9d7eea8aed 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..bfda5b6ad3 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [RFC PATCH v3 6/6] net/ice: implement power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
                       ` (3 preceding siblings ...)
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: " Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  4 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 8d435e8892..7d7e1dcbac 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -212,6 +212,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 2e1f06d2c0..c043181ceb 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d04..7eb6fa904e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (5 preceding siblings ...)
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
@ 2020-09-04 10:18   ` Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
                       ` (10 more replies)
  6 siblings, 11 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   2 +
 .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 213 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..cd7f8070ac
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..5041a830a7 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	/**< UMWAIT/TPAUSE Instructions */
+	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..6dd1cdc939
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 16:37       ` Stephen Hemminger
  2020-09-04 20:54       ` Ananyev, Konstantin
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
                       ` (9 subsequent siblings)
  10 siblings, 2 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple API allow ethdev get the last
available queue descriptor address from PMD.
Also include internal structure update.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.h      | 22 ++++++++++++++
 lib/librte_ethdev/rte_ethdev_core.h | 46 +++++++++++++++++++++++++++--
 2 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 70295d7ab7..d9312d3e11 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1602,6 +1604,26 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED,
+};
+
+enum rte_eth_dev_power_mgmt_cb_mode {
+	/** WAIT callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd418..16e54bb4e4 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,30 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ * @param expected
+ *   the pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   the pointer point to comparison bitmask for the expected value.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +776,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +794,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +841,16 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	/**< Power mgmt Callback mode */
+	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	const struct rte_eth_rxtx_callback *cur_pwr_cb;
+	void *reserved_ptrs[2];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 16:36       ` Stephen Hemminger
  2020-09-04 18:33       ` Ananyev, Konstantin
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
                       ` (8 subsequent siblings)
  10 siblings, 2 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 port 1 queue use case as there is
no coordination between queues/cores in ethdev. 1 port map to multiple
core will be supported in next version.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to releaf the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_power/meson.build           |   3 +-
 lib/librte_power/rte_power.h           |  38 +++++
 lib/librte_power/rte_power_pmd_mgmt.c  | 184 +++++++++++++++++++++++++
 lib/librte_power/rte_power_version.map |   4 +
 4 files changed, 228 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c

diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..44b01afce2 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
 headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power.h b/lib/librte_power/rte_power.h
index bbbde4dfb4..06d5a9984f 100644
--- a/lib/librte_power/rte_power.h
+++ b/lib/librte_power/rte_power.h
@@ -14,6 +14,7 @@
 #include <rte_byteorder.h>
 #include <rte_log.h>
 #include <rte_string_fns.h>
+#include <rte_ethdev.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -97,6 +98,43 @@ int rte_power_init(unsigned int lcore_id);
  */
 int rte_power_exit(unsigned int lcore_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function mode.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+				  uint16_t port_id,
+				  enum rte_eth_dev_power_mgmt_cb_mode mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
+
 /**
  * Get the available frequencies of a specific lcore.
  * Function pointer definition. Review each environments
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..a445153ede
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+
+#include "rte_power.h"
+
+
+
+static uint16_t
+rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = (*dev->dev_ops->next_rx_desc)
+				(dev->data->rx_queues[qidx],
+				 &target_addr, &expected, &mask);
+			if (ret == 0)
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor(target_addr,
+						  expected, mask,
+						  0, -1ULL);
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	int i;
+
+	if (unlikely(nb_rx == 0)) {
+
+		dev->empty_poll_stats[qidx].num++;
+
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
+				rte_pause();
+
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			/*scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+
+		}
+	} else {
+		dev->empty_poll_stats[qidx].num = 0;
+		/* scal up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+			uint16_t port_id,
+			enum rte_eth_dev_power_mgmt_cb_mode mode)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
+		return -EINVAL;
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+						  sizeof(struct rte_eth_ep_stat)
+						  * RTE_MAX_QUEUES_PER_PORT,
+						  0, dev->data->numa_node);
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	switch (mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+			return -ENOTSUP;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_umwait, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		/* init scale freq */
+		if (rte_power_init(lcore_id))
+			return -EINVAL;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+					rte_power_mgmt_scalefreq, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_pause, NULL);
+		break;
+	}
+
+	dev->cb_mode = mode;
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_power_pmd_mgmt_disable(unsigned int lcore_id,
+				uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/*add flag check */
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
+		return -EINVAL;
+
+	/* rte_free ignores NULL so safe to call without checks */
+	rte_free(dev->empty_poll_stats);
+
+	switch (dev->cb_mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		if (rte_power_exit(lcore_id))
+			return -EINVAL;
+		break;
+	}
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+	dev->cur_pwr_cb = NULL;
+	dev->cb_mode = 0;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 00ee5753e2..ade83cfd4f 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.08
+	rte_power_pmd_mgmt_disable;
+	rte_power_pmd_mgmt_enable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 5/6] net/i40e: " Liang Ma
                       ` (7 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index fd0cb9b0e2..618fc15732 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -592,6 +592,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf5137..d1d015deae 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b22..826f451bee 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 5/6] net/i40e: implement power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (2 preceding siblings ...)
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 10:19     ` [dpdk-dev] [PATCH v3 6/6] net/ice: " Liang Ma
                       ` (6 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 11c02b1888..94e9298d7c 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -517,6 +517,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index fe7f9200c1..9d7eea8aed 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..bfda5b6ad3 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v3 6/6] net/ice: implement power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (3 preceding siblings ...)
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 5/6] net/i40e: " Liang Ma
@ 2020-09-04 10:19     ` Liang Ma
  2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
                       ` (5 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-09-04 10:19 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 8d435e8892..7d7e1dcbac 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -212,6 +212,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 2e1f06d2c0..c043181ceb 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d04..7eb6fa904e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (4 preceding siblings ...)
  2020-09-04 10:19     ` [dpdk-dev] [PATCH v3 6/6] net/ice: " Liang Ma
@ 2020-09-04 16:23     ` Stephen Hemminger
  2020-09-14 20:48       ` Liang, Ma
  2020-09-04 16:37     ` Stephen Hemminger
                       ` (4 subsequent siblings)
  10 siblings, 1 reply; 421+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:23 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:55 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp);

Since this is generic code, and you are defining the function.
You should have it return -ENOTSUPPORTED or -EINVAL.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
@ 2020-09-04 16:36       ` Stephen Hemminger
  2020-09-14 20:52         ` Liang, Ma
  2020-09-04 18:33       ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:36 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:57 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API is limited to 1 core 1 port 1 queue use case as there is
> no coordination between queues/cores in ethdev. 1 port map to multiple
> core will be supported in next version.

The common way to express is this is:

This API is not thread-safe and not preempt-safe.
There is also no mechanism for a single thread to wait on multiple queues.

> 
> This design leverage RX Callback mechnaism which allow three
> different power management methodology co exist.

nit coexist is one word

> 
> 1. umwait/umonitor:
> 
>    The TSC timestamp is automatically calculated using current
>    link speed and RX descriptor ring size, such that the sleep
>    time is not longer than it would take for a NIC to fill its
>    entire RX descriptor ring.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this lightweight
>    method use Pause instruction to releaf the processor from
>    busy polling.

Wording here is a problem, and "releaf" should be "relief"?
Rewording into active voice grammar would be easier.

     Use Pause instruction to allow processor to go into deeper C
     state when busy polling.



> 
> 3. Frequency Scaling
>    Reuse exist rte power library to scale up/down core frequency
>    depend on traffic volume.


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
@ 2020-09-04 16:37       ` Stephen Hemminger
  2020-09-14 21:04         ` Liang, Ma
  2020-09-04 20:54       ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:37 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:56 +0100
Liang Ma <liang.j.ma@intel.com> wrote:



> +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */

Spelling here.

Also, shouldn't this be a per-device (or per-queue) configuration value.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (5 preceding siblings ...)
  2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
@ 2020-09-04 16:37     ` Stephen Hemminger
  2020-09-14 20:49       ` Liang, Ma
  2020-09-04 18:42     ` Stephen Hemminger
                       ` (3 subsequent siblings)
  10 siblings, 1 reply; 421+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:37 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:55 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

This looks like a useful feature but needs more documentation and example.
It would make sense to put an example in l3fwd-power. 


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
  2020-09-04 16:36       ` Stephen Hemminger
@ 2020-09-04 18:33       ` Ananyev, Konstantin
  2020-09-14 21:01         ` Liang, Ma
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-09-04 18:33 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, Burakov, Anatoly, Ma, Liang J

> 
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API is limited to 1 core 1 port 1 queue use case as there is
> no coordination between queues/cores in ethdev. 1 port map to multiple
> core will be supported in next version.
> 
> This design leverage RX Callback mechnaism which allow three
> different power management methodology co exist.
> 
> 1. umwait/umonitor:
> 
>    The TSC timestamp is automatically calculated using current
>    link speed and RX descriptor ring size, such that the sleep
>    time is not longer than it would take for a NIC to fill its
>    entire RX descriptor ring.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this lightweight
>    method use Pause instruction to releaf the processor from
>    busy polling.
> 
> 3. Frequency Scaling
>    Reuse exist rte power library to scale up/down core frequency
>    depend on traffic volume.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_power/meson.build           |   3 +-
>  lib/librte_power/rte_power.h           |  38 +++++
>  lib/librte_power/rte_power_pmd_mgmt.c  | 184 +++++++++++++++++++++++++
>  lib/librte_power/rte_power_version.map |   4 +
>  4 files changed, 228 insertions(+), 1 deletion(-)
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
> 
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 78c031c943..44b01afce2 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
>  		'power_kvm_vm.c', 'guest_channel.c',
>  		'rte_power_empty_poll.c',
>  		'power_pstate_cpufreq.c',
> +		'rte_power_pmd_mgmt.c',
>  		'power_common.c')
>  headers = files('rte_power.h','rte_power_empty_poll.h')
> -deps += ['timer']
> +deps += ['timer' ,'ethdev']
> diff --git a/lib/librte_power/rte_power.h b/lib/librte_power/rte_power.h
> index bbbde4dfb4..06d5a9984f 100644
> --- a/lib/librte_power/rte_power.h
> +++ b/lib/librte_power/rte_power.h
> @@ -14,6 +14,7 @@
>  #include <rte_byteorder.h>
>  #include <rte_log.h>
>  #include <rte_string_fns.h>
> +#include <rte_ethdev.h>
> 
>  #ifdef __cplusplus
>  extern "C" {
> @@ -97,6 +98,43 @@ int rte_power_init(unsigned int lcore_id);
>   */
>  int rte_power_exit(unsigned int lcore_id);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable device power management.
> + * @param lcore_id
> + *   lcore id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param mode
> + *   The power management callback function mode.
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_power_pmd_mgmt_enable(unsigned int lcore_id,
> +				  uint16_t port_id,
> +				  enum rte_eth_dev_power_mgmt_cb_mode mode);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Disable device power management.
> + * @param lcore_id
> + *   lcore id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_power_pmd_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
> +
>  /**
>   * Get the available frequencies of a specific lcore.
>   * Function pointer definition. Review each environments
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
> new file mode 100644
> index 0000000000..a445153ede
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
> @@ -0,0 +1,184 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_atomic.h>
> +#include <rte_malloc.h>
> +#include <rte_ethdev.h>
> +
> +#include "rte_power.h"
> +
> +
> +
> +static uint16_t
> +rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		dev->empty_poll_stats[qidx].num++;

I believe there are two fundamental issues with that approach:
1. You put metadata specific lib (power) callbacks into rte_eth_dev struct.
2. These callbacks do access rte_eth_devices[] directly. 
That doesn't look right to me - rte_eth_dev structure supposed to be treated
as internal one librt_ether and underlying drivers and should be accessed directly
by outer code.
If these callbacks need some extra metadata, then it is responsibility
of power library to allocate/manage these metadata.
You can pass pointer to this metadata via last parameter for rte_eth_add_rx_callback().

> +		if (unlikely(dev->empty_poll_stats[qidx].num >
> +			     ETH_EMPTYPOLL_MAX)) {
> +			volatile void *target_addr;
> +			uint64_t expected, mask;
> +			uint16_t ret;
> +
> +			/*
> +			 * get address of next descriptor in the RX
> +			 * ring for this queue, as well as expected
> +			 * value and a mask.
> +			 */
> +			ret = (*dev->dev_ops->next_rx_desc)
> +				(dev->data->rx_queues[qidx],
> +				 &target_addr, &expected, &mask);
> +			if (ret == 0)
> +				/* -1ULL is maximum value for TSC */
> +				rte_power_monitor(target_addr,
> +						  expected, mask,
> +						  0, -1ULL);
> +		}
> +	} else
> +		dev->empty_poll_stats[qidx].num = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	int i;
> +
> +	if (unlikely(nb_rx == 0)) {
> +
> +		dev->empty_poll_stats[qidx].num++;
> +
> +		if (unlikely(dev->empty_poll_stats[qidx].num >
> +			     ETH_EMPTYPOLL_MAX)) {
> +
> +			for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
> +				rte_pause();
> +
> +		}
> +	} else
> +		dev->empty_poll_stats[qidx].num = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		dev->empty_poll_stats[qidx].num++;
> +		if (unlikely(dev->empty_poll_stats[qidx].num >
> +			     ETH_EMPTYPOLL_MAX)) {
> +
> +			/*scale down freq */
> +			rte_power_freq_min(rte_lcore_id());
> +
> +		}
> +	} else {
> +		dev->empty_poll_stats[qidx].num = 0;
> +		/* scal up freq */
> +		rte_power_freq_max(rte_lcore_id());
> +	}
> +
> +	return nb_rx;
> +}
> +
> +int
> +rte_power_pmd_mgmt_enable(unsigned int lcore_id,
> +			uint16_t port_id,
> +			enum rte_eth_dev_power_mgmt_cb_mode mode)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
> +		return -EINVAL;
> +	/* allocate memory for empty poll stats */
> +	dev->empty_poll_stats = rte_malloc_socket(NULL,
> +						  sizeof(struct rte_eth_ep_stat)
> +						  * RTE_MAX_QUEUES_PER_PORT,
> +						  0, dev->data->numa_node);
> +	if (dev->empty_poll_stats == NULL)
> +		return -ENOMEM;
> +
> +	switch (mode) {
> +	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> +		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
> +			return -ENOTSUP;

Here and in other places: in case of error return you don't' free your empty_poll_stats.

> +		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,

Why zero for queue number, why not to pass queue_id as a parameter for that function?

> +						rte_power_mgmt_umwait, NULL);

As I said above, instead of NULL - could be pointer to metadata struct.

> +		break;
> +	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> +		/* init scale freq */
> +		if (rte_power_init(lcore_id))
> +			return -EINVAL;
> +		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> +					rte_power_mgmt_scalefreq, NULL);
> +		break;
> +	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> +		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> +						rte_power_mgmt_pause, NULL);
> +		break;
> +	}
> +
> +	dev->cb_mode = mode;
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
> +	return 0;
> +}
> +
> +int
> +rte_power_pmd_mgmt_disable(unsigned int lcore_id,
> +				uint16_t port_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	/*add flag check */
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
> +		return -EINVAL;
> +
> +	/* rte_free ignores NULL so safe to call without checks */
> +	rte_free(dev->empty_poll_stats);

You can't free callback metadata before removing the callback itself.
In fact, with current rx callback code it is not safe to free it
even after (we discussed it offline).

> +
> +	switch (dev->cb_mode) {
> +	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> +	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> +		rte_eth_remove_rx_callback(port_id, 0,
> +					   dev->cur_pwr_cb);
> +		break;
> +	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> +		rte_power_freq_max(lcore_id);

Stupid q: what makes you think that lcore frequency was max,
*before* you setup the callback?

> +		rte_eth_remove_rx_callback(port_id, 0,
> +					   dev->cur_pwr_cb);
> +		if (rte_power_exit(lcore_id))
> +			return -EINVAL;
> +		break;
> +	}
> +
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> +	dev->cur_pwr_cb = NULL;
> +	dev->cb_mode = 0;
> +
> +	return 0;
> +}
> diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> index 00ee5753e2..ade83cfd4f 100644
> --- a/lib/librte_power/rte_power_version.map
> +++ b/lib/librte_power/rte_power_version.map
> @@ -34,4 +34,8 @@ EXPERIMENTAL {
>  	rte_power_guest_channel_receive_msg;
>  	rte_power_poll_stat_fetch;
>  	rte_power_poll_stat_update;
> +	# added in 20.08
> +	rte_power_pmd_mgmt_disable;
> +	rte_power_pmd_mgmt_enable;
> +
>  };
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (6 preceding siblings ...)
  2020-09-04 16:37     ` Stephen Hemminger
@ 2020-09-04 18:42     ` Stephen Hemminger
  2020-09-14 21:12       ` Liang, Ma
  2020-09-16 16:34       ` Liang, Ma
  2020-09-06 21:44     ` Ananyev, Konstantin
                       ` (2 subsequent siblings)
  10 siblings, 2 replies; 421+ messages in thread
From: Stephen Hemminger @ 2020-09-04 18:42 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:55 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

Before this is merged, please work with Arm maintainers to have a version that
works on Arm 64 as well. Don't think this should be merged unless the two major
platforms supported by DPDK can work with it.

Also, not sure if this mechanism can work with other drivers. You need to
work with other vendors to show that the same infrastructure can work with
their hardware. Once again, I don't think this can go in if it only can
work on Intel.  It needs to work on Broadcom, Mellanox to be useful.

Will it work in a VM? Will it work with virtio or vmxnet3?

Having a single vendor solution is a non-starter for me.
They don't all have to be there to get it merged, but if the design only
works on single platform then it is not helpful.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
  2020-09-04 16:37       ` Stephen Hemminger
@ 2020-09-04 20:54       ` Ananyev, Konstantin
  1 sibling, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-09-04 20:54 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, Burakov, Anatoly, Ma, Liang J

> Add a simple API allow ethdev get the last
> available queue descriptor address from PMD.
> Also include internal structure update.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_ethdev/rte_ethdev.h      | 22 ++++++++++++++
>  lib/librte_ethdev/rte_ethdev_core.h | 46 +++++++++++++++++++++++++++--
>  2 files changed, 66 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 70295d7ab7..d9312d3e11 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -157,6 +157,7 @@ extern "C" {
>  #include <rte_common.h>
>  #include <rte_config.h>
>  #include <rte_ether.h>
> +#include <rte_power_intrinsics.h>
> 
>  #include "rte_ethdev_trace_fp.h"
>  #include "rte_dev_info.h"
> @@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
>  /** Maximum nb. of vlan per mirror rule */
>  #define ETH_MIRROR_MAX_VLANS       64
> 
> +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
>  #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
>  #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
>  #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
> @@ -1602,6 +1604,26 @@ enum rte_eth_dev_state {
>  	RTE_ETH_DEV_REMOVED,
>  };
> 
> +#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
> +/**
> + * Possible power management states of an ethdev port.
> + */
> +enum rte_eth_dev_power_mgmt_state {
> +	/** Device power management is disabled. */
> +	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
> +	/** Device power management is enabled. */
> +	RTE_ETH_DEV_POWER_MGMT_ENABLED,
> +};
> +
> +enum rte_eth_dev_power_mgmt_cb_mode {
> +	/** WAIT callback mode. */
> +	RTE_ETH_DEV_POWER_MGMT_CB_WAIT = 1,
> +	/** PAUSE callback mode. */
> +	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
> +	/** Freq Scaling callback mode. */
> +	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
> +};
> +

I don't think we need to put all these power related
staff into rte_ethdev library.
rte_power or so, seems like much better place for it.

>  struct rte_eth_dev_sriov {
>  	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
>  	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
> diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
> index 32407dd418..16e54bb4e4 100644
> --- a/lib/librte_ethdev/rte_ethdev_core.h
> +++ b/lib/librte_ethdev/rte_ethdev_core.h
> @@ -603,6 +603,30 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>  	 uint16_t nb_tx_desc,
>  	 const struct rte_eth_hairpin_conf *hairpin_conf);
> 
> +/**
> + * @internal
> + * Get the next RX ring descriptor address.
> + *
> + * @param rxq
> + *   ethdev queue pointer.
> + * @param tail_desc_addr
> + *   the pointer point to descriptor address var.
> + * @param expected
> + *   the pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   the pointer point to comparison bitmask for the expected value.
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success.
> + * @retval -EINVAL
> + *   Failed to get descriptor address.
> + */
> +typedef int (*eth_next_rx_desc_t)
> +	(void *rxq, volatile void **tail_desc_addr,
> +	 uint64_t *expected, uint64_t *mask);
> +

In theory it could be anything: next RXD, doorbell,
even some global variable.
So I think function name needs to be more neutral:
eth_rx_wait_addr() or so.
Also I think you need a new rte_eth_ wrapper function
for that dev op.  

>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -752,6 +776,8 @@ struct eth_dev_ops {
>  	/**< Set up device RX hairpin queue. */
>  	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>  	/**< Set up device TX hairpin queue. */
> +	eth_next_rx_desc_t next_rx_desc;
> +	/**< Get next RX ring descriptor address. */
>  };
> 
>  /**
> @@ -768,6 +794,14 @@ struct rte_eth_rxtx_callback {
>  	void *param;
>  };
> 
> +/**
> + * @internal
> + * Structure used to hold counters for empty poll
> + */
> +struct rte_eth_ep_stat {
> +	uint64_t num;
> +} __rte_cache_aligned;
> +
>  /**
>   * @internal
>   * The generic data structure associated with each ethernet device.
> @@ -807,8 +841,16 @@ struct rte_eth_dev {
>  	enum rte_eth_dev_state state; /**< Flag indicating the port state */
>  	void *security_ctx; /**< Context for security ops */
> 
> -	uint64_t reserved_64s[4]; /**< Reserved for future fields */
> -	void *reserved_ptrs[4];   /**< Reserved for future fields */
> +	/**< Empty poll number */
> +	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
> +	/**< Power mgmt Callback mode */
> +	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
> +	uint64_t reserved_64s[3]; /**< Reserved for future fields */
> +
> +	/**< Flag indicating the port power state */
> +	struct rte_eth_ep_stat *empty_poll_stats;
> +	const struct rte_eth_rxtx_callback *cur_pwr_cb;
> +	void *reserved_ptrs[2];   /**< Reserved for future fields */
>  } __rte_cache_aligned;
> 
>  struct rte_eth_dev_sriov;
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (7 preceding siblings ...)
  2020-09-04 18:42     ` Stephen Hemminger
@ 2020-09-06 21:44     ` Ananyev, Konstantin
  2020-09-18  5:01     ` Jerin Jacob
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
  10 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-09-06 21:44 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, Burakov, Anatoly, Ma, Liang J


> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..6dd1cdc939
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_

As a nit - if the function below are supported for both 64 and 32 ISA,
then probably: RTE_POWER_INTRINSIC_X86_H_


> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +
> +	if (value_mask) {
> +		const uint64_t cur_value = *(const volatile uint64_t *)p;
> +		const uint64_t masked = cur_value & value_mask;
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return 0;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
> +		/*
> +		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
> +		 * onto the stack, then pop them back into `rflags` so that
> +		 * we can read it.
> +		 */
> +		"pushf;\n"
> +		"pop %0;\n"
> +		: "=r"(rflags)
> +		: "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction. For more information about its usage,
> + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
> + * Manual.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
> +		     /*
> +		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
> +		      * onto the stack, then pop them back into `rflags` so that
> +		      * we can read it.
> +		      */
> +		     "pushf;\n"
> +		     "pop %0;\n"
> +		     : "=r"(rflags)
> +		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
@ 2020-09-14 20:48       ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 20:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

Hi Stephen, 
   Agree. v4 will address this.
Regards
Liang
On 04 Sep 09:23, Stephen Hemminger wrote:
> On Fri,  4 Sep 2020 11:18:55 +0100
> Liang Ma <liang.j.ma@intel.com> wrote:
> 
> > + *
> > + * @return
> > + *   Architecture-dependent return value.
> > + */
> > +static inline int rte_power_monitor(const volatile void *p,
> > +		const uint64_t expected_value, const uint64_t value_mask,
> > +		const uint32_t state, const uint64_t tsc_timestamp);
> 
> Since this is generic code, and you are defining the function.
> You should have it return -ENOTSUPPORTED or -EINVAL.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 16:37     ` Stephen Hemminger
@ 2020-09-14 20:49       ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 20:49 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

Hi Stephen, 
   v4 patch will include the l3fwd-power udpate.
Regards
Liang

On 04 Sep 09:37, Stephen Hemminger wrote:
> On Fri,  4 Sep 2020 11:18:55 +0100
> Liang Ma <liang.j.ma@intel.com> wrote:
> 
> > Add two new power management intrinsics, and provide an implementation
> > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > are implemented as raw byte opcodes because there is not yet widespread
> > compiler support for these instructions.
> > 
> > The power management instructions provide an architecture-specific
> > function to either wait until a specified TSC timestamp is reached, or
> > optionally wait until either a TSC timestamp is reached or a memory
> > location is written to. The monitor function also provides an optional
> > comparison, to avoid sleeping when the expected write has already
> > happened, and no more writes are expected.
> > 
> > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> This looks like a useful feature but needs more documentation and example.
> It would make sense to put an example in l3fwd-power. 
>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 16:36       ` Stephen Hemminger
@ 2020-09-14 20:52         ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 20:52 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

<snip>
Hi Stephen, 
    v4 will support 1 port with multiple core(still 1 queue per core)support
    this part description will be updated according to the design change.
Regards
Liang
> The common way to express is this is:
> 
> This API is not thread-safe and not preempt-safe.
> There is also no mechanism for a single thread to wait on multiple queues.
> 
> > 
> > This design leverage RX Callback mechnaism which allow three
> > different power management methodology co exist.
> 
> nit coexist is one word
> 
> > 
> > 1. umwait/umonitor:
> > 
> >    The TSC timestamp is automatically calculated using current
> >    link speed and RX descriptor ring size, such that the sleep
> >    time is not longer than it would take for a NIC to fill its
> >    entire RX descriptor ring.
> > 
> > 2. Pause instruction
> > 
> >    Instead of move the core into deeper C state, this lightweight
> >    method use Pause instruction to releaf the processor from
> >    busy polling.
> 
> Wording here is a problem, and "releaf" should be "relief"?
> Rewording into active voice grammar would be easier.
> 
>      Use Pause instruction to allow processor to go into deeper C
>      state when busy polling.
> 
> 
> 
> > 
> > 3. Frequency Scaling
> >    Reuse exist rte power library to scale up/down core frequency
> >    depend on traffic volume.
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 18:33       ` Ananyev, Konstantin
@ 2020-09-14 21:01         ` Liang, Ma
  2020-09-16 14:53           ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 21:01 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Hunt, David, Burakov, Anatoly

On 04 Sep 11:33, Ananyev, Konstantin wrote:

<snip>
> > +struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +if (unlikely(nb_rx == 0)) {
> > +dev->empty_poll_stats[qidx].num++;
>
Hi Konstantin,
   Agree, v4 will relocate the meta data to seperate structure. 
   and without touch the rte_ethdev structure. 
> I believe there are two fundamental issues with that approach:
> 1. You put metadata specific lib (power) callbacks into rte_eth_dev struct.
> 2. These callbacks do access rte_eth_devices[] directly.
> That doesn't look right to me - rte_eth_dev structure supposed to be treated
> as internal one librt_ether and underlying drivers and should be accessed directly
> by outer code.
> If these callbacks need some extra metadata, then it is responsibility
> of power library to allocate/manage these metadata.
> You can pass pointer to this metadata via last parameter for rte_eth_add_rx_callback().
> 
> > +if (unlikely(dev->empty_poll_stats[qidx].num >
> > +     ETH_EMPTYPOLL_MAX)) {
> > +volatile void *target_addr;
> > +uint64_t expected, mask;
> > +uint16_t ret;
> > +
> > +/*
> > + * get address of next descriptor in the RX
> > + * ring for this queue, as well as expected
> > + * value and a mask.
> > + */
> > +ret = (*dev->dev_ops->next_rx_desc)
> > +(dev->data->rx_queues[qidx],
> > + &target_addr, &expected, &mask);
> > +if (ret == 0)
> > +/* -1ULL is maximum value for TSC */
> > +rte_power_monitor(target_addr,
> > +  expected, mask,
> > +  0, -1ULL);
> > +}
> > +} else
> > +dev->empty_poll_stats[qidx].num = 0;
> > +
> > +return nb_rx;
> > +}
> > +
> > +static uint16_t
> > +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
> > +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> > +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> > +{
> > +struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +int i;
> > +
> > +if (unlikely(nb_rx == 0)) {
> > +
> > +dev->empty_poll_stats[qidx].num++;
> > +
> > +if (unlikely(dev->empty_poll_stats[qidx].num >
> > +     ETH_EMPTYPOLL_MAX)) {
> > +
> > +for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
> > +rte_pause();
> > +
> > +}
> > +} else
> > +dev->empty_poll_stats[qidx].num = 0;
> > +
> > +return nb_rx;
> > +}
> > +
> > +static uint16_t
> > +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> > +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> > +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> > +{
> > +struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +if (unlikely(nb_rx == 0)) {
> > +dev->empty_poll_stats[qidx].num++;
> > +if (unlikely(dev->empty_poll_stats[qidx].num >
> > +     ETH_EMPTYPOLL_MAX)) {
> > +
> > +/*scale down freq */
> > +rte_power_freq_min(rte_lcore_id());
> > +
> > +}
> > +} else {
> > +dev->empty_poll_stats[qidx].num = 0;
> > +/* scal up freq */
> > +rte_power_freq_max(rte_lcore_id());
> > +}
> > +
> > +return nb_rx;
> > +}
> > +
> > +int
> > +rte_power_pmd_mgmt_enable(unsigned int lcore_id,
> > +uint16_t port_id,
> > +enum rte_eth_dev_power_mgmt_cb_mode mode)
> > +{
> > +struct rte_eth_dev *dev;
> > +
> > +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> > +dev = &rte_eth_devices[port_id];
> > +
> > +if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
> > +return -EINVAL;
> > +/* allocate memory for empty poll stats */
> > +dev->empty_poll_stats = rte_malloc_socket(NULL,
> > +  sizeof(struct rte_eth_ep_stat)
> > +  * RTE_MAX_QUEUES_PER_PORT,
> > +  0, dev->data->numa_node);
> > +if (dev->empty_poll_stats == NULL)
> > +return -ENOMEM;
> > +
> > +switch (mode) {
> > +case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> > +if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
> > +return -ENOTSUP;
> 
> Here and in other places: in case of error return you don't' free your empty_poll_stats.
> 
> > +dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> 
> Why zero for queue number, why not to pass queue_id as a parameter for that function?
v4 will move to use queue_id instead of 0. v3 still assume only queue 0 is used.
> 
> > +rte_power_mgmt_umwait, NULL);
> 
> As I said above, instead of NULL - could be pointer to metadata struct.
v4 will address this. 
> 
> > +break;
> > +case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> > +/* init scale freq */
> > +if (rte_power_init(lcore_id))
> > +return -EINVAL;
> > +dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> > +rte_power_mgmt_scalefreq, NULL);
> > +break;
> > +case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> > +dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> > +rte_power_mgmt_pause, NULL);
> > +break;
> > +}
> > +
> > +dev->cb_mode = mode;
> > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
> > +return 0;
> > +}
> > +
> > +int
> > +rte_power_pmd_mgmt_disable(unsigned int lcore_id,
> > +uint16_t port_id)
> > +{
> > +struct rte_eth_dev *dev;
> > +
> > +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> > +dev = &rte_eth_devices[port_id];
> > +
> > +/*add flag check */
> > +
> > +if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
> > +return -EINVAL;
> > +
> > +/* rte_free ignores NULL so safe to call without checks */
> > +rte_free(dev->empty_poll_stats);
> 
> You can't free callback metadata before removing the callback itself.
> In fact, with current rx callback code it is not safe to free it
> even after (we discussed it offline).
agree. 
> 
> > +
> > +switch (dev->cb_mode) {
> > +case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> > +case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> > +rte_eth_remove_rx_callback(port_id, 0,
> > +   dev->cur_pwr_cb);
> > +break;
> > +case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> > +rte_power_freq_max(lcore_id);
> 
> Stupid q: what makes you think that lcore frequency was max,
> *before* you setup the callback?
that is because the rte_power_init() has figured out the system max.
the init code invocate rte_power_init() already. 
> 
> > +rte_eth_remove_rx_callback(port_id, 0,
> > +   dev->cur_pwr_cb);
> > +if (rte_power_exit(lcore_id))
> > +return -EINVAL;
> > +break;
> > +}
> > +
> > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > +dev->cur_pwr_cb = NULL;
> > +dev->cb_mode = 0;
> > +
> > +return 0;
> > +}
> > diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> > index 00ee5753e2..ade83cfd4f 100644
> > --- a/lib/librte_power/rte_power_version.map
> > +++ b/lib/librte_power/rte_power_version.map
> > @@ -34,4 +34,8 @@ EXPERIMENTAL {
> >  rte_power_guest_channel_receive_msg;
> >  rte_power_poll_stat_fetch;
> >  rte_power_poll_stat_update;
> > +# added in 20.08
> > +rte_power_pmd_mgmt_disable;
> > +rte_power_pmd_mgmt_enable;
> > +
> >  };
> > --
> > 2.17.1
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 16:37       ` Stephen Hemminger
@ 2020-09-14 21:04         ` Liang, Ma
  0 siblings, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 21:04 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

agree, will be addressed 
On 04 Sep 09:37, Stephen Hemminger wrote:
> On Fri,  4 Sep 2020 11:18:56 +0100
> Liang Ma <liang.j.ma@intel.com> wrote:
> 
> 
> 
> > +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
> 
> Spelling here.
> 
> Also, shouldn't this be a per-device (or per-queue) configuration value.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 18:42     ` Stephen Hemminger
@ 2020-09-14 21:12       ` Liang, Ma
  2020-09-16 16:34       ` Liang, Ma
  1 sibling, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-14 21:12 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

On 04 Sep 11:42, Stephen Hemminger wrote:
<snip>
we are very open to discuss design  with other vendor.
> Before this is merged, please work with Arm maintainers to have a version that
> works on Arm 64 as well. Don't think this should be merged unless the two major
> platforms supported by DPDK can work with it. 

> Also, not sure if this mechanism can work with other drivers. You need to
> work with other vendors to show that the same infrastructure can work with
> their hardware. Once again, I don't think this can go in if it only can
> work on Intel.  It needs to work on Broadcom, Mellanox to be useful.
this mechanism should work with any device use a HW ring descriptor mechanism. 
I think most Mellanox and Broadcom NIC can support it easily. 

> Will it work in a VM? Will it work with virtio or vmxnet3?
> 
General speaking, Guest OS is not very easy to use this.
However, virtio is under invetigation.
> Having a single vendor solution is a non-starter for me.
> They don't all have to be there to get it merged, but if the design only
> works on single platform then it is not helpful.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-14 21:01         ` Liang, Ma
@ 2020-09-16 14:53           ` Ananyev, Konstantin
  2020-09-16 16:39             ` Liang, Ma
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-09-16 14:53 UTC (permalink / raw)
  To: Ma, Liang J; +Cc: dev, Hunt, David, Burakov, Anatoly



> >
> > > +
> > > +switch (dev->cb_mode) {
> > > +case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> > > +case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> > > +rte_eth_remove_rx_callback(port_id, 0,
> > > +   dev->cur_pwr_cb);
> > > +break;
> > > +case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> > > +rte_power_freq_max(lcore_id);
> >
> > Stupid q: what makes you think that lcore frequency was max,
> > *before* you setup the callback?
> that is because the rte_power_init() has figured out the system max.
> the init code invocate rte_power_init() already.

So rte_power_init(lcore) always raises lcore frequency to
max possible value?

> >
> > > +rte_eth_remove_rx_callback(port_id, 0,
> > > +   dev->cur_pwr_cb);
> > > +if (rte_power_exit(lcore_id))
> > > +return -EINVAL;
> > > +break;
> > > +}
> > > +
> > > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > > +dev->cur_pwr_cb = NULL;
> > > +dev->cb_mode = 0;
> > > +
> > > +return 0;
> > > +}

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 18:42     ` Stephen Hemminger
  2020-09-14 21:12       ` Liang, Ma
@ 2020-09-16 16:34       ` Liang, Ma
  1 sibling, 0 replies; 421+ messages in thread
From: Liang, Ma @ 2020-09-16 16:34 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

On 04 Sep 11:42, Stephen Hemminger wrote:
<snip> 

we have discussed with arm developer in the past.
Please ref https://patches.dpdk.org/patch/70662/
There was no objection, in my opinoin.
Also the API we proposed has experimental tag, other vendor still can change it.

For the ethdev internal ops we introduced should work with any NIC use ring descriptor
writeback mechansim. But we lack the internal sight of Mellanox or Broadcom NIC. 

AF_XDP PMD and virtio-net is under investigation. 

I hope above explaination addressed your concern. 

> Before this is merged, please work with Arm maintainers to have a version that
> works on Arm 64 as well. Don't think this should be merged unless the two major
> platforms supported by DPDK can work with it.
> 
> Also, not sure if this mechanism can work with other drivers. You need to
> work with other vendors to show that the same infrastructure can work with
> their hardware. Once again, I don't think this can go in if it only can
> work on Intel.  It needs to work on Broadcom, Mellanox to be useful.
> 
> Will it work in a VM? Will it work with virtio or vmxnet3?
> 
> Having a single vendor solution is a non-starter for me.
> They don't all have to be there to get it merged, but if the design only
> works on single platform then it is not helpful.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-16 14:53           ` Ananyev, Konstantin
@ 2020-09-16 16:39             ` Liang, Ma
  2020-09-16 16:44               ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Liang, Ma @ 2020-09-16 16:39 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Hunt, David, Burakov, Anatoly

On 16 Sep 07:53, Ananyev, Konstantin wrote:
<snip>
Yes. we only has two gear. min or max. However, user still can customize
their system max with power mgmt python script on Intel platform. 
> So rte_power_init(lcore) always raises lcore frequency to
> max possible value?
> 
> > >
> > > > +rte_eth_remove_rx_callback(port_id, 0,
> > > > +   dev->cur_pwr_cb);
> > > > +if (rte_power_exit(lcore_id))
> > > > +return -EINVAL;
> > > > +break;
> > > > +}
> > > > +
> > > > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > > > +dev->cur_pwr_cb = NULL;
> > > > +dev->cb_mode = 0;
> > > > +
> > > > +return 0;
> > > > +}

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-16 16:39             ` Liang, Ma
@ 2020-09-16 16:44               ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-09-16 16:44 UTC (permalink / raw)
  To: Ma, Liang J; +Cc: dev, Hunt, David, Burakov, Anatoly

> On 16 Sep 07:53, Ananyev, Konstantin wrote:
> <snip>
> Yes. we only has two gear. min or max. However, user still can customize
> their system max with power mgmt python script on Intel platform.

Ok, thanks for explanation.

> > So rte_power_init(lcore) always raises lcore frequency to
> > max possible value?
> >
> > > >
> > > > > +rte_eth_remove_rx_callback(port_id, 0,
> > > > > +   dev->cur_pwr_cb);
> > > > > +if (rte_power_exit(lcore_id))
> > > > > +return -EINVAL;
> > > > > +break;
> > > > > +}
> > > > > +
> > > > > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > > > > +dev->cur_pwr_cb = NULL;
> > > > > +dev->cb_mode = 0;
> > > > > +
> > > > > +return 0;
> > > > > +}

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (8 preceding siblings ...)
  2020-09-06 21:44     ` Ananyev, Konstantin
@ 2020-09-18  5:01     ` Jerin Jacob
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
  10 siblings, 0 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-09-18  5:01 UTC (permalink / raw)
  To: Liang Ma, Honnappa Nagarahalli, Stephen Hemminger
  Cc: dpdk-dev, David Hunt, Anatoly Burakov, Richardson, Bruce,
	Ananyev, Konstantin, Thomas Monjalon

On Fri, Sep 4, 2020 at 3:49 PM Liang Ma <liang.j.ma@intel.com> wrote:
>
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
>
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>


> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
[snip]
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint32_t state, const uint64_t tsc_timestamp)

IMO, We must introduce some arch feature-capability _get_ scheme to tell
the consumer of this API is only supported on x86. Probably as functions[1]
or macro flags scheme and have a stub for the other architectures as the
API marked as generic ie rte_power_* not rte_x86_..

This will help the consumer to create workers based on the instruction features
which can NOT be abstracted as a generic feature across the architectures.


[1]
struct rte_arch_inst_feat {
        uint32_t power_monitor      : 1;  /**< Power monitor */
...
}

void rte_arch_inst_feat_get(struct rte_arch_inst_feat *feat);

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (9 preceding siblings ...)
  2020-09-18  5:01     ` Jerin Jacob
@ 2020-10-02 14:11     ` Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
                         ` (20 more replies)
  10 siblings, 21 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Add new x86 cpuid support for WAITPKG.
This flag indicate processor support umwait/umonitor/tpause
instruction.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/x86/include/rte_cpuflags.h | 2 ++
 lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..5041a830a7 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	/**< UMWAIT/TPAUSE Instructions */
+	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-08  8:33         ` Thomas Monjalon
  2020-10-08 17:15         ` Ananyev, Konstantin
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API Liang Ma
                         ` (19 subsequent siblings)
  20 siblings, 2 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

For more details, Please reference Intel SDM Volume 2.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
 4 files changed, 209 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..cd7f8070ac
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..6dd1cdc939
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-08  8:46         ` Thomas Monjalon
  2020-10-08 22:26         ` Ananyev, Konstantin
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback Liang Ma
                         ` (18 subsequent siblings)
  20 siblings, 2 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Add a simple API allow ethdev get wake up address from PMD.
Also include internal structure update.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.c           | 19 ++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_version.map |  1 +
 4 files changed, 72 insertions(+)

diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index d7668114ca..88253d95f9 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -4804,6 +4804,25 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		      volatile void **wake_addr,
+		      uint64_t *expected, uint64_t *mask)
+{
+	struct rte_eth_dev *dev;
+	uint16_t ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	ret = (*dev->dev_ops->get_wake_addr)
+				(dev->data->rx_queues[queue_id],
+				 wake_addr, expected, mask);
+
+	return ret;
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index d2bf74f128..a6cfe3cd57 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4014,6 +4014,30 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * Retrieve the wake up address from specific queue
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Tx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param wake_addr
+ *   The pointer point to the address which is used for monitoring.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ *
+ * @return
+ *   - 0: Success.
+ *   -EINVAL: Failed to get wake address.
+ */
+__rte_experimental
+int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+			  volatile void **wake_addr,
+			  uint64_t *expected, uint64_t *mask);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index c3062c246c..935d46f25c 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the Wake up address.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param tail_desc_addr
+ *   The pointer point to descriptor address var.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_get_wake_addr_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -713,6 +738,9 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_get_wake_addr_t get_wake_addr;
+	/**< Get wake up address. */
+
 };
 
 /**
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index c95ef5157a..3cb2093980 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -229,6 +229,7 @@ EXPERIMENTAL {
 	# added in 20.11
 	rte_eth_link_speed_to_str;
 	rte_eth_link_to_str;
+	rte_eth_get_wake_addr;
 };
 
 INTERNAL {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-09 16:38         ` Ananyev, Konstantin
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API Liang Ma
                         ` (17 subsequent siblings)
  20 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API support 1 port to multiple core use case.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to relief the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/pmd_mgmt.h            |  49 ++++++
 lib/librte_power/rte_power_pmd_mgmt.c  | 208 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  88 +++++++++++
 lib/librte_power/rte_power_version.map |   4 +
 5 files changed, 352 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/pmd_mgmt.h
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/pmd_mgmt.h b/lib/librte_power/pmd_mgmt.h
new file mode 100644
index 0000000000..756fbe20f7
--- /dev/null
+++ b/lib/librte_power/pmd_mgmt.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _PMD_MGMT_H
+#define _PMD_MGMT_H
+
+/**
+ * @file
+ * Power Management
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< Power mgmt Callback mode */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Empty poll number */
+	uint16_t empty_poll_stats;
+	/**< Callback instance  */
+	const struct rte_eth_rxtx_callback *cur_cb;
+} __rte_cache_aligned;
+
+struct pmd_port_cfg {
+	int  ref_cnt;
+	struct pmd_queue_cfg *queue_cfg;
+} __rte_cache_aligned;
+
+
+
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..35d2af46a4
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,208 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+#include "pmd_mgmt.h"
+
+
+#define EMPTYPOLL_MAX  512
+#define PAUSE_NUM  64
+
+static struct pmd_port_cfg port_cfg[RTE_MAX_ETHPORTS];
+
+static uint16_t
+rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = rte_eth_get_wake_addr(port_id, qidx,
+						    &target_addr, &expected,
+						    &mask);
+			if (ret == 0)
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor(target_addr,
+						  expected, mask,
+						  0, -1ULL);
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+	int i;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			for (i = 0; i < PAUSE_NUM; i++)
+				rte_pause();
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/*scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+
+		}
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scal up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret = 0;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (port_cfg[port_id].queue_cfg == NULL) {
+		port_cfg[port_id].ref_cnt = 0;
+		/* allocate memory for empty poll stats */
+		port_cfg[port_id].queue_cfg  = rte_malloc_socket(NULL,
+					sizeof(struct pmd_queue_cfg)
+					* RTE_MAX_QUEUES_PER_PORT,
+					0, dev->data->numa_node);
+		if (port_cfg[port_id].queue_cfg == NULL)
+			return -ENOMEM;
+	}
+
+	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto failure_handler;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
+			ret = -ENOTSUP;
+			goto failure_handler;
+		}
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+						rte_power_mgmt_umwait, NULL);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		/* init scale freq */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto failure_handler;
+		}
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+					rte_power_mgmt_scalefreq, NULL);
+		break;
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+						rte_power_mgmt_pause, NULL);
+		break;
+	}
+	queue_cfg->cb_mode = mode;
+	port_cfg[port_id].ref_cnt++;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	return ret;
+
+failure_handler:
+	if (port_cfg[port_id].ref_cnt == 0) {
+		rte_free(port_cfg[port_id].queue_cfg);
+		port_cfg[port_id].queue_cfg = NULL;
+	}
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	if (port_cfg[port_id].ref_cnt <= 0)
+		return -EINVAL;
+
+	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED)
+		return -EINVAL;
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+					   queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+					   queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/* it's not recommend to free callback instance here.
+	 * it cause memory leak which is a known issue.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	port_cfg[port_id].ref_cnt--;
+
+	if (port_cfg[port_id].ref_cnt == 0) {
+		rte_free(port_cfg[port_id].queue_cfg);
+		port_cfg[port_id].queue_cfg = NULL;
+	}
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..8b110f1148
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (2 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-09 15:53         ` Ananyev, Konstantin
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 06/10] net/i40e: " Liang Ma
                         ` (16 subsequent siblings)
  20 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 0b98e210e7..30b3f416d4 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_wake_addr        = ixgbe_get_wake_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf5137..7a9fd2aec6 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b22..75020fa2fc 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 06/10] net/i40e: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (3 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-09 16:01         ` Ananyev, Konstantin
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 07/10] net/ice: " Liang Ma
                         ` (15 subsequent siblings)
  20 siblings, 1 reply; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 943cfe71dc..cab86f8ec9 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_wake_addr	              = i40e_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 322fc1ed75..c17f27292f 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..f23a2073e3 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 07/10] net/ice: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (4 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 06/10] net/i40e: " Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 08/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
                         ` (14 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma, Anatoly Burakov

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index d8ce09d28f..260de5dfd7 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_wake_addr	              = ice_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 93a0ac6918..9e55eca942 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -25,6 +25,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1c23c7541e..c729e474c9 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		      uint64_t *expected, uint64_t *mask);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 08/10] examples/l3fwd-power: enable PMD power mgmt
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (5 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 07/10] net/ice: " Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 09/10] doc: update release notes for PMD power management Liang Ma
                         ` (13 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma

Add pmd power mgmt feature support.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 examples/l3fwd-power/main.c | 44 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index d0e6c9bd77..b1b139129a 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,8 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
+
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,7 +201,8 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
@@ -1750,6 +1753,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1771,6 +1775,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1881,6 +1886,16 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt  mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2437,6 +2452,9 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
+
 	default:
 		return "invalid";
 	}
@@ -2705,6 +2723,12 @@ main(int argc, char **argv)
 			} else if (!check_ptype(portid))
 				rte_exit(EXIT_FAILURE,
 					 "PMD can not provide needed ptypes\n");
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				rte_power_pmd_mgmt_queue_enable(lcore_id,
+							portid, queueid,
+						RTE_POWER_MGMT_TYPE_SCALE);
+
+			}
 		}
 	}
 
@@ -2790,8 +2814,12 @@ main(int argc, char **argv)
 						SKIP_MASTER);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MASTER);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL,
+					 CALL_MASTER);
 	}
 
+
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
 		launch_timer(rte_lcore_id());
 
@@ -2812,6 +2840,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 09/10] doc: update release notes for PMD power management
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (6 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 08/10] examples/l3fwd-power: enable PMD power mgmt Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 10/10] doc: update the programming guide " Liang Ma
                         ` (12 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma

Add release notes for PMD power management

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/rel_notes/release_20_11.rst | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index c2175f37f3..57ac73722a 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**
+
+  * ``rte_eth_get_wake_addr()``
+  * add new eth_dev_ops ``get_wake_addr``
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
@@ -107,6 +112,17 @@ New Features
   * Extern objects and functions can be plugged into the pipeline.
   * Transaction-oriented table updates.
 
+* **Add PMD power management mechanism**
+
+  3 new PMD power managmeent mechanism is added through existing
+  RX_ETH_CALLBACK infrastructure.
+
+  * Add umwait power saving scheme
+  * Add pause power saving scheme
+  * Add frequency scaling power saving scheme
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v4 10/10] doc: update the programming guide for PMD power management
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (7 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 09/10] doc: update release notes for PMD power management Liang Ma
@ 2020-10-02 14:11       ` Liang Ma
  2020-10-02 14:44       ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Bruce Richardson
                         ` (11 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Liang Ma @ 2020-10-02 14:11 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, stephen, konstantin.ananyev, Liang Ma

Update programming guide and sample application l3fwd-power document
for PMD power management

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/prog_guide/power_man.rst           | 40 +++++++++++++++++++
 .../sample_app_ug/l3_forward_power_man.rst    | 15 ++++++-
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..c95b948874 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -188,6 +188,43 @@ API Overview for Empty Poll Power Management
 
 * **Detect empty poll state change**: empty poll state change detection algorithm then take action.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Given existing power management mechanism, developer need change application design or code to use it.
+In order to solve the problem, it's very helpful to make the design transparent to application.
+The proposed solution is to leverage RX_CALLBACK mechanism which allow three different power management
+methodology co exist. The trigger condition is empty poll number beyond defined threshold.
+
+  * umwait/umonitor
+
+   The new umwait/umonitor instruction monitoring the wake address then transfer processor to sub-state.
+   Once the content of address is changed, the processor will be wake up from the sub-state. Timeout is
+   setup as well, in case, there is no wake event happen, processor still will wake up after timeout
+   timer expired.
+
+  * Pause instruction
+
+   Instead of move the core into deeper C state, this lightweight method use Pause instruction
+   to relief the processor from busy polling.
+
+  * Frequency Scaling
+
+   Reuse exist rte power library to scale up/down core frequency
+   depend on traffic volume.
+
+The proposed solution support multiple port and each port can map to multiple core. But 1 core only can map
+1 queue(regardless which port). In theory, each queue belongs to same port can apply different power scheme.
+It's strongly recommend to use same power scheme for all queues belong to same port.
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
@@ -200,3 +237,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 0cc6f2e62e..82f9ac849c 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -107,7 +107,9 @@ where,
 
 *   --empty-poll: Traffic Aware power management. See below for details
 
-*   --telemetry:  Telemetry mode.
+*   --telemetry: Telemetry mode.
+
+*   --pmd-mgmt: PMD power management mode.
 
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
@@ -459,3 +461,14 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./examples/l3fwd-power/build/l3fwd-power --pmd-mgmt -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)"
-- 
2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (8 preceding siblings ...)
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 10/10] doc: update the programming guide " Liang Ma
@ 2020-10-02 14:44       ` Bruce Richardson
  2020-10-08 22:08       ` Ananyev, Konstantin
                         ` (10 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Bruce Richardson @ 2020-10-02 14:44 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, stephen, konstantin.ananyev, Anatoly Burakov

On Fri, Oct 02, 2020 at 03:11:50PM +0100, Liang Ma wrote:
> Add new x86 cpuid support for WAITPKG.
> This flag indicate processor support umwait/umonitor/tpause
> instruction.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_eal/x86/include/rte_cpuflags.h | 2 ++
>  lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d1..5041a830a7 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
>  	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
>  
> +	/**< UMWAIT/TPAUSE Instructions */
> +	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
Typo: UMINITOR

>  	/* The last item */
>  	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
>  };
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 30439e7951..0325c4b93b 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
>  	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
>  	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
>  
> +	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
> +
>  	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
>  	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
@ 2020-10-08  8:33         ` Thomas Monjalon
  2020-10-08  8:44           ` Jerin Jacob
  2020-10-08 17:15         ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-08  8:33 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, david.hunt, stephen, konstantin.ananyev, Anatoly Burakov,
	Liang Ma, honnappa.nagarahalli, ruifeng.wang, David Christensen,
	jerinj

> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> For more details, Please reference Intel SDM Volume 2.

I really would like to see feedbacks from other arch maintainers.
Unfortunately they were not Cc'ed.

Also please mark the new functions as experimental.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08  8:33         ` Thomas Monjalon
@ 2020-10-08  8:44           ` Jerin Jacob
  2020-10-08  9:41             ` Thomas Monjalon
  2020-10-08 13:26             ` Burakov, Anatoly
  0 siblings, 2 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-10-08  8:44 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Liang Ma, dpdk-dev, David Hunt, Stephen Hemminger, Ananyev,
	Konstantin, Anatoly Burakov, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> > Add two new power management intrinsics, and provide an implementation
> > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > are implemented as raw byte opcodes because there is not yet widespread
> > compiler support for these instructions.
> >
> > The power management instructions provide an architecture-specific
> > function to either wait until a specified TSC timestamp is reached, or
> > optionally wait until either a TSC timestamp is reached or a memory
> > location is written to. The monitor function also provides an optional
> > comparison, to avoid sleeping when the expected write has already
> > happened, and no more writes are expected.
> >
> > For more details, Please reference Intel SDM Volume 2.
>
> I really would like to see feedbacks from other arch maintainers.
> Unfortunately they were not Cc'ed.

Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
http://mails.dpdk.org/archives/dev/2020-September/181646.html

> Also please mark the new functions as experimental.
>
>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API Liang Ma
@ 2020-10-08  8:46         ` Thomas Monjalon
  2020-10-08 11:39           ` Ananyev, Konstantin
  2020-10-08 22:26         ` Ananyev, Konstantin
  1 sibling, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-08  8:46 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, david.hunt, stephen, konstantin.ananyev, Anatoly Burakov,
	Liang Ma, ferruh.yigit, arybchenko, honnappa.nagarahalli,
	ruifeng.wang, jerinj, David Christensen

> +/**
> + * Retrieve the wake up address from specific queue
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Tx queue on the Ethernet device for which information
> + *   will be retrieved.
> + * @param wake_addr
> + *   The pointer point to the address which is used for monitoring.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + *
> + * @return
> + *   - 0: Success.
> + *   -EINVAL: Failed to get wake address.
> + */
> +__rte_experimental
> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +                         volatile void **wake_addr,
> +                         uint64_t *expected, uint64_t *mask);

It looks to be a very low-level API.
Can't we do something more "ready-to-use" at ethdev level?

Cc'in the relevant maintainers...
Note: sorry this comment come late but ethdev maintainers were not Cc.
Reminder: having no feedback is not a good sign, you should request comments.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08  8:44           ` Jerin Jacob
@ 2020-10-08  9:41             ` Thomas Monjalon
  2020-10-08 13:26             ` Burakov, Anatoly
  1 sibling, 0 replies; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-08  9:41 UTC (permalink / raw)
  To: Liang Ma
  Cc: dev, David Hunt, Stephen Hemminger, Ananyev, Konstantin,
	Anatoly Burakov, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

08/10/2020 10:44, Jerin Jacob:
> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >
> > > Add two new power management intrinsics, and provide an implementation
> > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > are implemented as raw byte opcodes because there is not yet widespread
> > > compiler support for these instructions.
> > >
> > > The power management instructions provide an architecture-specific
> > > function to either wait until a specified TSC timestamp is reached, or
> > > optionally wait until either a TSC timestamp is reached or a memory
> > > location is written to. The monitor function also provides an optional
> > > comparison, to avoid sleeping when the expected write has already
> > > happened, and no more writes are expected.
> > >
> > > For more details, Please reference Intel SDM Volume 2.
> >
> > I really would like to see feedbacks from other arch maintainers.
> > Unfortunately they were not Cc'ed.
> 
> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> http://mails.dpdk.org/archives/dev/2020-September/181646.html

This comment was sent on September 18.
Later this v4 was sent without replying to the comments.
This is blocking the series.
I am considering this feature as low priority.

> > Also please mark the new functions as experimental.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API
  2020-10-08  8:46         ` Thomas Monjalon
@ 2020-10-08 11:39           ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-08 11:39 UTC (permalink / raw)
  To: Thomas Monjalon, Ma, Liang J
  Cc: dev, Hunt, David, stephen, Burakov, Anatoly, Ma, Liang J, Yigit,
	Ferruh, arybchenko, honnappa.nagarahalli, ruifeng.wang, jerinj,
	David Christensen



> > +/**
> > + * Retrieve the wake up address from specific queue
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The Tx queue on the Ethernet device for which information
> > + *   will be retrieved.
> > + * @param wake_addr
> > + *   The pointer point to the address which is used for monitoring.
> > + * @param expected
> > + *   The pointer point to value to be expected when descriptor is set.
> > + * @param mask
> > + *   The pointer point to comparison bitmask for the expected value.
> > + *
> > + * @return
> > + *   - 0: Success.
> > + *   -EINVAL: Failed to get wake address.
> > + */
> > +__rte_experimental
> > +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> > +                         volatile void **wake_addr,
> > +                         uint64_t *expected, uint64_t *mask);
> 
> It looks to be a very low-level API.
> Can't we do something more "ready-to-use" at ethdev level?

I think that series provides both:
There is a low-level API at ethdev/eal to retrieve information to wait for
and actual function to put core to sleep.
Plus there is a high-level API at rte_power lib:
rte_power_pmd_mgmt_queue_enable()/rte_power_pmd_mgmt_queue_disable()
that uses these low-level ones and puts some high-level logic around it.
From my perspective it is a good design choice,
as it keeps all power-related burden inside rte_power library and
provides user a lot of flexibility in terms of API usage.
Konstantin 

> 
> Cc'in the relevant maintainers...
> Note: sorry this comment come late but ethdev maintainers were not Cc.
> Reminder: having no feedback is not a good sign, you should request comments.
> 


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08  8:44           ` Jerin Jacob
  2020-10-08  9:41             ` Thomas Monjalon
@ 2020-10-08 13:26             ` Burakov, Anatoly
  2020-10-08 15:13               ` Jerin Jacob
  1 sibling, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-08 13:26 UTC (permalink / raw)
  To: Jerin Jacob, Thomas Monjalon
  Cc: Liang Ma, dpdk-dev, David Hunt, Stephen Hemminger, Ananyev,
	Konstantin, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>
>>> Add two new power management intrinsics, and provide an implementation
>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>> are implemented as raw byte opcodes because there is not yet widespread
>>> compiler support for these instructions.
>>>
>>> The power management instructions provide an architecture-specific
>>> function to either wait until a specified TSC timestamp is reached, or
>>> optionally wait until either a TSC timestamp is reached or a memory
>>> location is written to. The monitor function also provides an optional
>>> comparison, to avoid sleeping when the expected write has already
>>> happened, and no more writes are expected.
>>>
>>> For more details, Please reference Intel SDM Volume 2.
>>
>> I really would like to see feedbacks from other arch maintainers.
>> Unfortunately they were not Cc'ed.
> 
> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> 
>> Also please mark the new functions as experimental.
>>
>>

Hi Jerin,

 > IMO, We must introduce some arch feature-capability _get_ scheme to tell
 > the consumer of this API is only supported on x86. Probably as 
functions[1]
 > or macro flags scheme and have a stub for the other architectures as the
 > API marked as generic ie rte_power_* not rte_x86_..
 >
 > This will help the consumer to create workers based on the 
instruction features
 > which can NOT be abstracted as a generic feature across the 
architectures.

I'm not entirely sure what you mean by that.

I mean, yes, we should have added stubs for other architectures, and we 
will add those in future revisions, but what does your proposed runtime 
check accomplish that cannot currently be done with CPUID flags?

If you look at patch 1 [1], we added CPUID flags that the user can 
check, and in fact this is precisely what we do in patch 4 [2] before 
enabling the UMWAIT path. We could perhaps document this better and 
outline the dependency on the WAITPKG CPUID flag more explicitly, but 
otherwise i don't see how what you're proposing isn't already possible 
to do.

[1] http://patches.dpdk.org/patch/79539/
[2] http://patches.dpdk.org/patch/79540/ , function 
rte_power_pmd_mgmt_queue_enable()

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08 13:26             ` Burakov, Anatoly
@ 2020-10-08 15:13               ` Jerin Jacob
  2020-10-08 17:07                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-08 15:13 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Thomas Monjalon, Liang Ma, dpdk-dev, David Hunt,
	Stephen Hemminger, Ananyev, Konstantin, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>
> >>> Add two new power management intrinsics, and provide an implementation
> >>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>> are implemented as raw byte opcodes because there is not yet widespread
> >>> compiler support for these instructions.
> >>>
> >>> The power management instructions provide an architecture-specific
> >>> function to either wait until a specified TSC timestamp is reached, or
> >>> optionally wait until either a TSC timestamp is reached or a memory
> >>> location is written to. The monitor function also provides an optional
> >>> comparison, to avoid sleeping when the expected write has already
> >>> happened, and no more writes are expected.
> >>>
> >>> For more details, Please reference Intel SDM Volume 2.
> >>
> >> I really would like to see feedbacks from other arch maintainers.
> >> Unfortunately they were not Cc'ed.
> >
> > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >
> >> Also please mark the new functions as experimental.
> >>
> >>
>
> Hi Jerin,

Hi Anatoly,

>
>  > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>  > the consumer of this API is only supported on x86. Probably as
> functions[1]
>  > or macro flags scheme and have a stub for the other architectures as the
>  > API marked as generic ie rte_power_* not rte_x86_..
>  >
>  > This will help the consumer to create workers based on the
> instruction features
>  > which can NOT be abstracted as a generic feature across the
> architectures.
>
> I'm not entirely sure what you mean by that.
>
> I mean, yes, we should have added stubs for other architectures, and we
> will add those in future revisions, but what does your proposed runtime
> check accomplish that cannot currently be done with CPUID flags?


RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.



>
> If you look at patch 1 [1], we added CPUID flags that the user can
> check, and in fact this is precisely what we do in patch 4 [2] before
> enabling the UMWAIT path. We could perhaps document this better and
> outline the dependency on the WAITPKG CPUID flag more explicitly, but
> otherwise i don't see how what you're proposing isn't already possible
> to do.
>
> [1] http://patches.dpdk.org/patch/79539/
> [2] http://patches.dpdk.org/patch/79540/ , function
> rte_power_pmd_mgmt_queue_enable()
>
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08 15:13               ` Jerin Jacob
@ 2020-10-08 17:07                 ` Ananyev, Konstantin
  2020-10-09  5:42                   ` Jerin Jacob
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-08 17:07 UTC (permalink / raw)
  To: Jerin Jacob, Burakov, Anatoly
  Cc: Thomas Monjalon, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

> 
> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
> >
> > On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > >>
> > >>> Add two new power management intrinsics, and provide an implementation
> > >>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > >>> are implemented as raw byte opcodes because there is not yet widespread
> > >>> compiler support for these instructions.
> > >>>
> > >>> The power management instructions provide an architecture-specific
> > >>> function to either wait until a specified TSC timestamp is reached, or
> > >>> optionally wait until either a TSC timestamp is reached or a memory
> > >>> location is written to. The monitor function also provides an optional
> > >>> comparison, to avoid sleeping when the expected write has already
> > >>> happened, and no more writes are expected.
> > >>>
> > >>> For more details, Please reference Intel SDM Volume 2.
> > >>
> > >> I really would like to see feedbacks from other arch maintainers.
> > >> Unfortunately they were not Cc'ed.
> > >
> > > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> > >
> > >> Also please mark the new functions as experimental.
> > >>
> > >>
> >
> > Hi Jerin,
> 
> Hi Anatoly,
> 
> >
> >  > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >  > the consumer of this API is only supported on x86. Probably as
> > functions[1]
> >  > or macro flags scheme and have a stub for the other architectures as the
> >  > API marked as generic ie rte_power_* not rte_x86_..
> >  >
> >  > This will help the consumer to create workers based on the
> > instruction features
> >  > which can NOT be abstracted as a generic feature across the
> > architectures.
> >
> > I'm not entirely sure what you mean by that.
> >
> > I mean, yes, we should have added stubs for other architectures, and we
> > will add those in future revisions, but what does your proposed runtime
> > check accomplish that cannot currently be done with CPUID flags?
> 
> 
> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.


I am agree with Jerin, that we need some generic way to
figure-out does platform supports power_monitor() or not.
Though not sure do we need to create a new feature-get framework here...
Might be just something like:
 rte_power_monitor(...) == -ENOTSUP
be enough indication for that?
So user can just do:
if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
	/* not supported  path */
}

To check is that feature supported or not.

> >
> > If you look at patch 1 [1], we added CPUID flags that the user can
> > check, and in fact this is precisely what we do in patch 4 [2] before
> > enabling the UMWAIT path. We could perhaps document this better and
> > outline the dependency on the WAITPKG CPUID flag more explicitly, but
> > otherwise i don't see how what you're proposing isn't already possible
> > to do.
> >
> > [1] http://patches.dpdk.org/patch/79539/
> > [2] http://patches.dpdk.org/patch/79540/ , function
> > rte_power_pmd_mgmt_queue_enable()
> >
> > --
> > Thanks,
> > Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics Liang Ma
  2020-10-08  8:33         ` Thomas Monjalon
@ 2020-10-08 17:15         ` Ananyev, Konstantin
  2020-10-09  9:11           ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-08 17:15 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly

> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.

I think what this API is missing - a function to wakeup sleeping core.
If user can/should use some system call to achieve that, then at least
it has to be clearly documented, even better some wrapper provided.

> 
> For more details, Please reference Intel SDM Volume 2.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
>  4 files changed, 209 insertions(+)
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
> 
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..cd7f8070ac
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp);
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index cd09027958..3a12e87e19 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -60,6 +60,7 @@ generic_headers = files(
>  	'generic/rte_memcpy.h',
>  	'generic/rte_pause.h',
>  	'generic/rte_prefetch.h',
> +	'generic/rte_power_intrinsics.h',
>  	'generic/rte_rwlock.h',
>  	'generic/rte_spinlock.h',
>  	'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>  	'rte_io.h',
>  	'rte_memcpy.h',
>  	'rte_prefetch.h',
> +	'rte_power_intrinsics.h',
>  	'rte_pause.h',
>  	'rte_rtm.h',
>  	'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..6dd1cdc939
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +
> +	if (value_mask) {
> +		const uint64_t cur_value = *(const volatile uint64_t *)p;
> +		const uint64_t masked = cur_value & value_mask;
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return 0;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
> +		/*
> +		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
> +		 * onto the stack, then pop them back into `rflags` so that
> +		 * we can read it.
> +		 */
> +		"pushf;\n"
> +		"pop %0;\n"
> +		: "=r"(rflags)
> +		: "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction. For more information about its usage,
> + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
> + * Manual.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
> +		     /*
> +		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
> +		      * onto the stack, then pop them back into `rflags` so that
> +		      * we can read it.
> +		      */
> +		     "pushf;\n"
> +		     "pop %0;\n"
> +		     : "=r"(rflags)
> +		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (9 preceding siblings ...)
  2020-10-02 14:44       ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Bruce Richardson
@ 2020-10-08 22:08       ` Ananyev, Konstantin
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
                         ` (9 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-08 22:08 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly


> Add new x86 cpuid support for WAITPKG.
> This flag indicate processor support umwait/umonitor/tpause
> instruction.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_eal/x86/include/rte_cpuflags.h | 2 ++
>  lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d1..5041a830a7 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
>  	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
> 
> +	/**< UMWAIT/TPAUSE Instructions */
> +	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
>  	/* The last item */
>  	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
>  };
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 30439e7951..0325c4b93b 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
>  	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
>  	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
> 
> +	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
> +
>  	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
>  	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
> 
> --

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API Liang Ma
  2020-10-08  8:46         ` Thomas Monjalon
@ 2020-10-08 22:26         ` Ananyev, Konstantin
  2020-10-09 16:11           ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-08 22:26 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly

> 
> Add a simple API allow ethdev get wake up address from PMD.
> Also include internal structure update.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_ethdev/rte_ethdev.c           | 19 ++++++++++++++++
>  lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_version.map |  1 +
>  4 files changed, 72 insertions(+)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index d7668114ca..88253d95f9 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -4804,6 +4804,25 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
>  }
> 
> +int
> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +		      volatile void **wake_addr,
> +		      uint64_t *expected, uint64_t *mask)
> +{
> +	struct rte_eth_dev *dev;
> +	uint16_t ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +
> +	dev = &rte_eth_devices[port_id];
> +
> +	ret = (*dev->dev_ops->get_wake_addr)
> +				(dev->data->rx_queues[queue_id],
> +				 wake_addr, expected, mask);


This is an optional dev_ops, so I think you need to check that get_wake_addr()
is defined for that PMD.
Plus you need to check that queue_id is valid.

> +
> +	return ret;
> +}
> +
>  int
>  rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>  			     struct rte_ether_addr *mc_addr_set,
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index d2bf74f128..a6cfe3cd57 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -4014,6 +4014,30 @@ __rte_experimental
>  int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  	struct rte_eth_burst_mode *mode);
> 
> +/**
> + * Retrieve the wake up address from specific queue
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Tx queue on the Ethernet device for which information
> + *   will be retrieved.
> + * @param wake_addr
> + *   The pointer point to the address which is used for monitoring.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + *
> + * @return
> + *   - 0: Success.
> + *   -EINVAL: Failed to get wake address.
> + */
> +__rte_experimental
> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +			  volatile void **wake_addr,
> +			  uint64_t *expected, uint64_t *mask);
> +
>  /**
>   * Retrieve device registers and register attributes (number of registers and
>   * register size)
> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
> index c3062c246c..935d46f25c 100644
> --- a/lib/librte_ethdev/rte_ethdev_driver.h
> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
> @@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>  	 uint16_t nb_tx_desc,
>  	 const struct rte_eth_hairpin_conf *hairpin_conf);
> 
> +/**
> + * @internal
> + * Get the Wake up address.
> + *
> + * @param rxq
> + *   Ethdev queue pointer.
> + * @param tail_desc_addr
> + *   The pointer point to descriptor address var.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success.
> + * @retval -EINVAL
> + *   Failed to get descriptor address.
> + */
> +typedef int (*eth_get_wake_addr_t)
> +	(void *rxq, volatile void **tail_desc_addr,
> +	 uint64_t *expected, uint64_t *mask);
> +
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -713,6 +738,9 @@ struct eth_dev_ops {
>  	/**< Set up device RX hairpin queue. */
>  	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>  	/**< Set up device TX hairpin queue. */
> +	eth_get_wake_addr_t get_wake_addr;
> +	/**< Get wake up address. */
> +
>  };
> 
>  /**
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> index c95ef5157a..3cb2093980 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -229,6 +229,7 @@ EXPERIMENTAL {
>  	# added in 20.11
>  	rte_eth_link_speed_to_str;
>  	rte_eth_link_to_str;
> +	rte_eth_get_wake_addr;
>  };
> 
>  INTERNAL {
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08 17:07                 ` Ananyev, Konstantin
@ 2020-10-09  5:42                   ` Jerin Jacob
  2020-10-09  9:25                     ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-09  5:42 UTC (permalink / raw)
  To: Ananyev, Konstantin, David Marchand
  Cc: Burakov, Anatoly, Thomas Monjalon, Ma, Liang J, dpdk-dev, Hunt,
	David, Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
> >
> > On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> > >
> > > On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > > > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > >>
> > > >>> Add two new power management intrinsics, and provide an implementation
> > > >>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > >>> are implemented as raw byte opcodes because there is not yet widespread
> > > >>> compiler support for these instructions.
> > > >>>
> > > >>> The power management instructions provide an architecture-specific
> > > >>> function to either wait until a specified TSC timestamp is reached, or
> > > >>> optionally wait until either a TSC timestamp is reached or a memory
> > > >>> location is written to. The monitor function also provides an optional
> > > >>> comparison, to avoid sleeping when the expected write has already
> > > >>> happened, and no more writes are expected.
> > > >>>
> > > >>> For more details, Please reference Intel SDM Volume 2.
> > > >>
> > > >> I really would like to see feedbacks from other arch maintainers.
> > > >> Unfortunately they were not Cc'ed.
> > > >
> > > > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > > > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> > > >
> > > >> Also please mark the new functions as experimental.
> > > >>
> > > >>
> > >
> > > Hi Jerin,
> >
> > Hi Anatoly,
> >
> > >
> > >  > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> > >  > the consumer of this API is only supported on x86. Probably as
> > > functions[1]
> > >  > or macro flags scheme and have a stub for the other architectures as the
> > >  > API marked as generic ie rte_power_* not rte_x86_..
> > >  >
> > >  > This will help the consumer to create workers based on the
> > > instruction features
> > >  > which can NOT be abstracted as a generic feature across the
> > > architectures.
> > >
> > > I'm not entirely sure what you mean by that.
> > >
> > > I mean, yes, we should have added stubs for other architectures, and we
> > > will add those in future revisions, but what does your proposed runtime
> > > check accomplish that cannot currently be done with CPUID flags?
> >
> >
> > RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> > i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> > and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> > I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>
>
> I am agree with Jerin, that we need some generic way to
> figure-out does platform supports power_monitor() or not.
> Though not sure do we need to create a new feature-get framework here...

That's works too. Some means of generic probing is fine. Following
schemed needs
more documentation on that usage, as, it is not straight forward compare to
feature-get framework. Also, on the other thread, we are adding the
new instructions like
demote cacheline etc, maybe if the user wants to KNOW if the arch
supports it then
the feature-get framework is good.
If we think, there is no other usecase for generic arch feature-get
framework then
we can keep the below scheme else generic arch feature is better for
more forward
looking use cases.

> Might be just something like:
>  rte_power_monitor(...) == -ENOTSUP
> be enough indication for that?
> So user can just do:
> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>         /* not supported  path */
> }
>
> To check is that feature supported or not.


>
> > >
> > > If you look at patch 1 [1], we added CPUID flags that the user can
> > > check, and in fact this is precisely what we do in patch 4 [2] before
> > > enabling the UMWAIT path. We could perhaps document this better and
> > > outline the dependency on the WAITPKG CPUID flag more explicitly, but
> > > otherwise i don't see how what you're proposing isn't already possible
> > > to do.
> > >
> > > [1] http://patches.dpdk.org/patch/79539/
> > > [2] http://patches.dpdk.org/patch/79540/ , function
> > > rte_power_pmd_mgmt_queue_enable()
> > >
> > > --
> > > Thanks,
> > > Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-08 17:15         ` Ananyev, Konstantin
@ 2020-10-09  9:11           ` Burakov, Anatoly
  2020-10-09 15:39             ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09  9:11 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
>>
>> Add two new power management intrinsics, and provide an implementation
>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>> are implemented as raw byte opcodes because there is not yet widespread
>> compiler support for these instructions.
>>
>> The power management instructions provide an architecture-specific
>> function to either wait until a specified TSC timestamp is reached, or
>> optionally wait until either a TSC timestamp is reached or a memory
>> location is written to. The monitor function also provides an optional
>> comparison, to avoid sleeping when the expected write has already
>> happened, and no more writes are expected.
> 
> I think what this API is missing - a function to wakeup sleeping core.
> If user can/should use some system call to achieve that, then at least
> it has to be clearly documented, even better some wrapper provided.

I don't think it's possible to do that without severely overcomplicating 
the intrinsic and its usage, because AFAIK the only way to wake up a 
sleeping core would be to send some kind of interrupt to the core, or 
trigger a write to the cache-line in question.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  5:42                   ` Jerin Jacob
@ 2020-10-09  9:25                     ` Burakov, Anatoly
  2020-10-09  9:29                       ` Thomas Monjalon
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09  9:25 UTC (permalink / raw)
  To: Jerin Jacob, Ananyev, Konstantin, David Marchand
  Cc: Thomas Monjalon, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
>>
>>>
>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>> <anatoly.burakov@intel.com> wrote:
>>>>
>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>
>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>> compiler support for these instructions.
>>>>>>>
>>>>>>> The power management instructions provide an architecture-specific
>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>> happened, and no more writes are expected.
>>>>>>>
>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>
>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>> Unfortunately they were not Cc'ed.
>>>>>
>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>
>>>>>> Also please mark the new functions as experimental.
>>>>>>
>>>>>>
>>>>
>>>> Hi Jerin,
>>>
>>> Hi Anatoly,
>>>
>>>>
>>>>   > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>   > the consumer of this API is only supported on x86. Probably as
>>>> functions[1]
>>>>   > or macro flags scheme and have a stub for the other architectures as the
>>>>   > API marked as generic ie rte_power_* not rte_x86_..
>>>>   >
>>>>   > This will help the consumer to create workers based on the
>>>> instruction features
>>>>   > which can NOT be abstracted as a generic feature across the
>>>> architectures.
>>>>
>>>> I'm not entirely sure what you mean by that.
>>>>
>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>> will add those in future revisions, but what does your proposed runtime
>>>> check accomplish that cannot currently be done with CPUID flags?
>>>
>>>
>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>
>>
>> I am agree with Jerin, that we need some generic way to
>> figure-out does platform supports power_monitor() or not.
>> Though not sure do we need to create a new feature-get framework here...
> 
> That's works too. Some means of generic probing is fine. Following
> schemed needs
> more documentation on that usage, as, it is not straight forward compare to
> feature-get framework. Also, on the other thread, we are adding the
> new instructions like
> demote cacheline etc, maybe if the user wants to KNOW if the arch
> supports it then
> the feature-get framework is good.
> If we think, there is no other usecase for generic arch feature-get
> framework then
> we can keep the below scheme else generic arch feature is better for
> more forward
> looking use cases.
> 
>> Might be just something like:
>>   rte_power_monitor(...) == -ENOTSUP
>> be enough indication for that?
>> So user can just do:
>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>          /* not supported  path */
>> }
>>
>> To check is that feature supported or not.
> 
> 

Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think 
we can safely make this intrinsic as a noop on other archs as well, as 
it's functionally identical to waking up immediately.

If we're not creating this for CLDEMOTE, we don't need it here as well. 
If we do need it for this, then we arguably need it for CLDEMOTE too.

>>
>>>>
>>>> If you look at patch 1 [1], we added CPUID flags that the user can
>>>> check, and in fact this is precisely what we do in patch 4 [2] before
>>>> enabling the UMWAIT path. We could perhaps document this better and
>>>> outline the dependency on the WAITPKG CPUID flag more explicitly, but
>>>> otherwise i don't see how what you're proposing isn't already possible
>>>> to do.
>>>>
>>>> [1] http://patches.dpdk.org/patch/79539/
>>>> [2] http://patches.dpdk.org/patch/79540/ , function
>>>> rte_power_pmd_mgmt_queue_enable()
>>>>
>>>> --
>>>> Thanks,
>>>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  9:25                     ` Burakov, Anatoly
@ 2020-10-09  9:29                       ` Thomas Monjalon
  2020-10-09  9:40                         ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-09  9:29 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Jerin Jacob, Ananyev, Konstantin, David Marchand, Ma, Liang J,
	dpdk-dev, Hunt, David, Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

09/10/2020 11:25, Burakov, Anatoly:
> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> > On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> >>
> >>>
> >>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>> <anatoly.burakov@intel.com> wrote:
> >>>>
> >>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>
> >>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>> compiler support for these instructions.
> >>>>>>>
> >>>>>>> The power management instructions provide an architecture-specific
> >>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>> happened, and no more writes are expected.
> >>>>>>>
> >>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>
> >>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>> Unfortunately they were not Cc'ed.
> >>>>>
> >>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>
> >>>>>> Also please mark the new functions as experimental.
> >>>>>>
> >>>>>>
> >>>>
> >>>> Hi Jerin,
> >>>
> >>> Hi Anatoly,
> >>>
> >>>>
> >>>>   > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>   > the consumer of this API is only supported on x86. Probably as
> >>>> functions[1]
> >>>>   > or macro flags scheme and have a stub for the other architectures as the
> >>>>   > API marked as generic ie rte_power_* not rte_x86_..
> >>>>   >
> >>>>   > This will help the consumer to create workers based on the
> >>>> instruction features
> >>>>   > which can NOT be abstracted as a generic feature across the
> >>>> architectures.
> >>>>
> >>>> I'm not entirely sure what you mean by that.
> >>>>
> >>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>> will add those in future revisions, but what does your proposed runtime
> >>>> check accomplish that cannot currently be done with CPUID flags?
> >>>
> >>>
> >>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>
> >>
> >> I am agree with Jerin, that we need some generic way to
> >> figure-out does platform supports power_monitor() or not.
> >> Though not sure do we need to create a new feature-get framework here...
> > 
> > That's works too. Some means of generic probing is fine. Following
> > schemed needs
> > more documentation on that usage, as, it is not straight forward compare to
> > feature-get framework. Also, on the other thread, we are adding the
> > new instructions like
> > demote cacheline etc, maybe if the user wants to KNOW if the arch
> > supports it then
> > the feature-get framework is good.
> > If we think, there is no other usecase for generic arch feature-get
> > framework then
> > we can keep the below scheme else generic arch feature is better for
> > more forward
> > looking use cases.
> > 
> >> Might be just something like:
> >>   rte_power_monitor(...) == -ENOTSUP
> >> be enough indication for that?
> >> So user can just do:
> >> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>          /* not supported  path */
> >> }
> >>
> >> To check is that feature supported or not.
> > 
> > 
> 
> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think 
> we can safely make this intrinsic as a noop on other archs as well, as 
> it's functionally identical to waking up immediately.
> 
> If we're not creating this for CLDEMOTE, we don't need it here as well. 
> If we do need it for this, then we arguably need it for CLDEMOTE too.

Sorry I don't understand what you mean, too many "it" and "this" :)



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  9:29                       ` Thomas Monjalon
@ 2020-10-09  9:40                         ` Burakov, Anatoly
  2020-10-09  9:54                           ` Jerin Jacob
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09  9:40 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Jerin Jacob, Ananyev, Konstantin, David Marchand, Ma, Liang J,
	dpdk-dev, Hunt, David, Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> 09/10/2020 11:25, Burakov, Anatoly:
>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>> <konstantin.ananyev@intel.com> wrote:
>>>>
>>>>>
>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>
>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>
>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>> compiler support for these instructions.
>>>>>>>>>
>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>
>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>
>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>
>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>
>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> Hi Jerin,
>>>>>
>>>>> Hi Anatoly,
>>>>>
>>>>>>
>>>>>>    > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>    > the consumer of this API is only supported on x86. Probably as
>>>>>> functions[1]
>>>>>>    > or macro flags scheme and have a stub for the other architectures as the
>>>>>>    > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>    >
>>>>>>    > This will help the consumer to create workers based on the
>>>>>> instruction features
>>>>>>    > which can NOT be abstracted as a generic feature across the
>>>>>> architectures.
>>>>>>
>>>>>> I'm not entirely sure what you mean by that.
>>>>>>
>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>
>>>>>
>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>
>>>>
>>>> I am agree with Jerin, that we need some generic way to
>>>> figure-out does platform supports power_monitor() or not.
>>>> Though not sure do we need to create a new feature-get framework here...
>>>
>>> That's works too. Some means of generic probing is fine. Following
>>> schemed needs
>>> more documentation on that usage, as, it is not straight forward compare to
>>> feature-get framework. Also, on the other thread, we are adding the
>>> new instructions like
>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>> supports it then
>>> the feature-get framework is good.
>>> If we think, there is no other usecase for generic arch feature-get
>>> framework then
>>> we can keep the below scheme else generic arch feature is better for
>>> more forward
>>> looking use cases.
>>>
>>>> Might be just something like:
>>>>    rte_power_monitor(...) == -ENOTSUP
>>>> be enough indication for that?
>>>> So user can just do:
>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>           /* not supported  path */
>>>> }
>>>>
>>>> To check is that feature supported or not.
>>>
>>>
>>
>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>> we can safely make this intrinsic as a noop on other archs as well, as
>> it's functionally identical to waking up immediately.
>>
>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> 
> Sorry I don't understand what you mean, too many "it" and "this" :)
> 

Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't 
exist on other archs, this doesn't too, so it's a fairly similar 
situation. Stubbing UMWAIT with a noop is a valid approach because it's 
equivalent to sleeping and then immediately waking up (which can happen 
for a host of reasons unrelated to the code itself).

I'm not against a generic feature-get framework, i'm just pointing out 
that if this is what's preventing the merge, it should prevent the merge 
of CLDEMOTE as well, yet Jerin has acked that one and has explicitly 
stated that he's OK with leaving CLDEMOTE as a noop on other architectures.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  9:40                         ` Burakov, Anatoly
@ 2020-10-09  9:54                           ` Jerin Jacob
  2020-10-09 10:03                             ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-09  9:54 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Thomas Monjalon, Ananyev, Konstantin, David Marchand, Ma,
	Liang J, dpdk-dev, Hunt, David, Stephen Hemminger,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> > 09/10/2020 11:25, Burakov, Anatoly:
> >> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>> <konstantin.ananyev@intel.com> wrote:
> >>>>
> >>>>>
> >>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>
> >>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>
> >>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>> compiler support for these instructions.
> >>>>>>>>>
> >>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>
> >>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>
> >>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>
> >>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>
> >>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>> Hi Jerin,
> >>>>>
> >>>>> Hi Anatoly,
> >>>>>
> >>>>>>
> >>>>>>    > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>    > the consumer of this API is only supported on x86. Probably as
> >>>>>> functions[1]
> >>>>>>    > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>    > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>    >
> >>>>>>    > This will help the consumer to create workers based on the
> >>>>>> instruction features
> >>>>>>    > which can NOT be abstracted as a generic feature across the
> >>>>>> architectures.
> >>>>>>
> >>>>>> I'm not entirely sure what you mean by that.
> >>>>>>
> >>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>
> >>>>>
> >>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>
> >>>>
> >>>> I am agree with Jerin, that we need some generic way to
> >>>> figure-out does platform supports power_monitor() or not.
> >>>> Though not sure do we need to create a new feature-get framework here...
> >>>
> >>> That's works too. Some means of generic probing is fine. Following
> >>> schemed needs
> >>> more documentation on that usage, as, it is not straight forward compare to
> >>> feature-get framework. Also, on the other thread, we are adding the
> >>> new instructions like
> >>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>> supports it then
> >>> the feature-get framework is good.
> >>> If we think, there is no other usecase for generic arch feature-get
> >>> framework then
> >>> we can keep the below scheme else generic arch feature is better for
> >>> more forward
> >>> looking use cases.
> >>>
> >>>> Might be just something like:
> >>>>    rte_power_monitor(...) == -ENOTSUP
> >>>> be enough indication for that?
> >>>> So user can just do:
> >>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>           /* not supported  path */
> >>>> }
> >>>>
> >>>> To check is that feature supported or not.
> >>>
> >>>
> >>
> >> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >> we can safely make this intrinsic as a noop on other archs as well, as
> >> it's functionally identical to waking up immediately.
> >>
> >> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >
> > Sorry I don't understand what you mean, too many "it" and "this" :)
> >
>
> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> exist on other archs, this doesn't too, so it's a fairly similar
> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> equivalent to sleeping and then immediately waking up (which can happen
> for a host of reasons unrelated to the code itself).

If we are keeping the following return in the public API then it can not be NOP
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */

Also, we need to fix compilation issue if any with
http://patches.dpdk.org/patch/79540/
as it has direct reference to if
(!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
Either we need to add -ENOTSUP return or generic feature-get framework.


>
> I'm not against a generic feature-get framework, i'm just pointing out
> that if this is what's preventing the merge, it should prevent the merge
> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
>
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  9:54                           ` Jerin Jacob
@ 2020-10-09 10:03                             ` Burakov, Anatoly
  2020-10-09 10:17                               ` Thomas Monjalon
  2020-10-09 10:19                               ` Jerin Jacob
  0 siblings, 2 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 10:03 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Thomas Monjalon, Ananyev, Konstantin, David Marchand, Ma,
	Liang J, dpdk-dev, Hunt, David, Stephen Hemminger,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
>>
>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>
>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>
>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>
>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>
>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>
>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>
>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>
>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Jerin,
>>>>>>>
>>>>>>> Hi Anatoly,
>>>>>>>
>>>>>>>>
>>>>>>>>     > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>     > the consumer of this API is only supported on x86. Probably as
>>>>>>>> functions[1]
>>>>>>>>     > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>     > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>     >
>>>>>>>>     > This will help the consumer to create workers based on the
>>>>>>>> instruction features
>>>>>>>>     > which can NOT be abstracted as a generic feature across the
>>>>>>>> architectures.
>>>>>>>>
>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>
>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>
>>>>>>>
>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>
>>>>>>
>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>
>>>>> That's works too. Some means of generic probing is fine. Following
>>>>> schemed needs
>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>> new instructions like
>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>> supports it then
>>>>> the feature-get framework is good.
>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>> framework then
>>>>> we can keep the below scheme else generic arch feature is better for
>>>>> more forward
>>>>> looking use cases.
>>>>>
>>>>>> Might be just something like:
>>>>>>     rte_power_monitor(...) == -ENOTSUP
>>>>>> be enough indication for that?
>>>>>> So user can just do:
>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>            /* not supported  path */
>>>>>> }
>>>>>>
>>>>>> To check is that feature supported or not.
>>>>>
>>>>>
>>>>
>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>> it's functionally identical to waking up immediately.
>>>>
>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>
>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>
>>
>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>> exist on other archs, this doesn't too, so it's a fairly similar
>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>> equivalent to sleeping and then immediately waking up (which can happen
>> for a host of reasons unrelated to the code itself).
> 
> If we are keeping the following return in the public API then it can not be NOP
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> 

In the generic header, it is specified that return value is 
implementation-defined (i.e. arch-specific). I guess we could remove 
that and set return value to either 0 or -ENOTSUP if that would resolve 
the issue?

> Also, we need to fix compilation issue if any with
> http://patches.dpdk.org/patch/79540/
> as it has direct reference to if
> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> Either we need to add -ENOTSUP return or generic feature-get framework.

IIRC power library isn't compiled on anything other than x86, so this 
code wouldn't get compiled.

> 
> 
>>
>> I'm not against a generic feature-get framework, i'm just pointing out
>> that if this is what's preventing the merge, it should prevent the merge
>> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
>>
>> --
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:03                             ` Burakov, Anatoly
@ 2020-10-09 10:17                               ` Thomas Monjalon
  2020-10-09 10:22                                 ` Burakov, Anatoly
  2020-10-09 10:19                               ` Jerin Jacob
  1 sibling, 1 reply; 421+ messages in thread
From: Thomas Monjalon @ 2020-10-09 10:17 UTC (permalink / raw)
  To: Jerin Jacob, Burakov, Anatoly
  Cc: Ananyev, Konstantin, David Marchand, Ma, Liang J, dpdk-dev, Hunt,
	David, Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

09/10/2020 12:03, Burakov, Anatoly:
> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> > On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> >> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>
> >>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>
> >>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>
> >>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>
> >>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>
> >>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jerin,
> >>>>>>>
> >>>>>>> Hi Anatoly,
> >>>>>>>
> >>>>>>>>
> >>>>>>>>     > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>     > the consumer of this API is only supported on x86. Probably as
> >>>>>>>> functions[1]
> >>>>>>>>     > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>     > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>     >
> >>>>>>>>     > This will help the consumer to create workers based on the
> >>>>>>>> instruction features
> >>>>>>>>     > which can NOT be abstracted as a generic feature across the
> >>>>>>>> architectures.
> >>>>>>>>
> >>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>
> >>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>
> >>>>>>>
> >>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>
> >>>>>>
> >>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>
> >>>>> That's works too. Some means of generic probing is fine. Following
> >>>>> schemed needs
> >>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>> new instructions like
> >>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>> supports it then
> >>>>> the feature-get framework is good.
> >>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>> framework then
> >>>>> we can keep the below scheme else generic arch feature is better for
> >>>>> more forward
> >>>>> looking use cases.
> >>>>>
> >>>>>> Might be just something like:
> >>>>>>     rte_power_monitor(...) == -ENOTSUP
> >>>>>> be enough indication for that?
> >>>>>> So user can just do:
> >>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>            /* not supported  path */
> >>>>>> }
> >>>>>>
> >>>>>> To check is that feature supported or not.
> >>>>>
> >>>>>
> >>>>
> >>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>> it's functionally identical to waking up immediately.
> >>>>
> >>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>
> >>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>
> >>
> >> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >> exist on other archs, this doesn't too, so it's a fairly similar
> >> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >> equivalent to sleeping and then immediately waking up (which can happen
> >> for a host of reasons unrelated to the code itself).
> > 
> > If we are keeping the following return in the public API then it can not be NOP
> > + * @return
> > + *   - 1 if wakeup was due to TSC timeout expiration.
> > + *   - 0 if wakeup was due to memory write or other reasons.
> > + */
> > 
> 
> In the generic header, it is specified that return value is 
> implementation-defined (i.e. arch-specific).

Obviously an API definition should *never* be "implementation-defined".


> I guess we could remove 
> that and set return value to either 0 or -ENOTSUP if that would resolve 
> the issue?
> 
> > Also, we need to fix compilation issue if any with
> > http://patches.dpdk.org/patch/79540/
> > as it has direct reference to if
> > (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> > Either we need to add -ENOTSUP return or generic feature-get framework.
> 
> IIRC power library isn't compiled on anything other than x86, so this 
> code wouldn't get compiled.

It is not call "power-x86", so we must assume it could work
on any architecture.


> >> I'm not against a generic feature-get framework, i'm just pointing out
> >> that if this is what's preventing the merge, it should prevent the merge
> >> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> >> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.

CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
that's why the expectations may be different.



^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:03                             ` Burakov, Anatoly
  2020-10-09 10:17                               ` Thomas Monjalon
@ 2020-10-09 10:19                               ` Jerin Jacob
  1 sibling, 0 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-10-09 10:19 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Thomas Monjalon, Ananyev, Konstantin, David Marchand, Ma,
	Liang J, dpdk-dev, Hunt, David, Stephen Hemminger,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Fri, Oct 9, 2020 at 3:33 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> > On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> >>
> >> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>>
> >>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>
> >>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>
> >>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>
> >>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>
> >>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>
> >>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jerin,
> >>>>>>>
> >>>>>>> Hi Anatoly,
> >>>>>>>
> >>>>>>>>
> >>>>>>>>     > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>     > the consumer of this API is only supported on x86. Probably as
> >>>>>>>> functions[1]
> >>>>>>>>     > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>     > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>     >
> >>>>>>>>     > This will help the consumer to create workers based on the
> >>>>>>>> instruction features
> >>>>>>>>     > which can NOT be abstracted as a generic feature across the
> >>>>>>>> architectures.
> >>>>>>>>
> >>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>
> >>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>
> >>>>>>>
> >>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>
> >>>>>>
> >>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>
> >>>>> That's works too. Some means of generic probing is fine. Following
> >>>>> schemed needs
> >>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>> new instructions like
> >>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>> supports it then
> >>>>> the feature-get framework is good.
> >>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>> framework then
> >>>>> we can keep the below scheme else generic arch feature is better for
> >>>>> more forward
> >>>>> looking use cases.
> >>>>>
> >>>>>> Might be just something like:
> >>>>>>     rte_power_monitor(...) == -ENOTSUP
> >>>>>> be enough indication for that?
> >>>>>> So user can just do:
> >>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>            /* not supported  path */
> >>>>>> }
> >>>>>>
> >>>>>> To check is that feature supported or not.
> >>>>>
> >>>>>
> >>>>
> >>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>> it's functionally identical to waking up immediately.
> >>>>
> >>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>
> >>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>
> >>
> >> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >> exist on other archs, this doesn't too, so it's a fairly similar
> >> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >> equivalent to sleeping and then immediately waking up (which can happen
> >> for a host of reasons unrelated to the code itself).
> >
> > If we are keeping the following return in the public API then it can not be NOP
> > + * @return
> > + *   - 1 if wakeup was due to TSC timeout expiration.
> > + *   - 0 if wakeup was due to memory write or other reasons.
> > + */
> >
>
> In the generic header, it is specified that return value is
> implementation-defined (i.e. arch-specific). I guess we could remove
> that and set return value to either 0 or -ENOTSUP if that would resolve
> the issue?

returning -ENOTSUP should be OK if we don't  to take generic
feature-get framework  path.

>
> > Also, we need to fix compilation issue if any with
> > http://patches.dpdk.org/patch/79540/
> > as it has direct reference to if
> > (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> > Either we need to add -ENOTSUP return or generic feature-get framework.
>
> IIRC power library isn't compiled on anything other than x86, so this
> code wouldn't get compiled.

Just checked now, librte_power compiles at least for arm64.


>
> >
> >
> >>
> >> I'm not against a generic feature-get framework, i'm just pointing out
> >> that if this is what's preventing the merge, it should prevent the merge
> >> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> >> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> >>
> >> --
> >> Thanks,
> >> Anatoly
>
>
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:17                               ` Thomas Monjalon
@ 2020-10-09 10:22                                 ` Burakov, Anatoly
  2020-10-09 10:45                                   ` Jerin Jacob
  2020-10-09 10:48                                   ` Ananyev, Konstantin
  0 siblings, 2 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 10:22 UTC (permalink / raw)
  To: Thomas Monjalon, Jerin Jacob
  Cc: Ananyev, Konstantin, David Marchand, Ma, Liang J, dpdk-dev, Hunt,
	David, Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> 09/10/2020 12:03, Burakov, Anatoly:
>> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
>>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
>>> <anatoly.burakov@intel.com> wrote:
>>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>>>
>>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>>>
>>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>>>
>>>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Jerin,
>>>>>>>>>
>>>>>>>>> Hi Anatoly,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>      > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>>>      > the consumer of this API is only supported on x86. Probably as
>>>>>>>>>> functions[1]
>>>>>>>>>>      > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>>>      > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>>>      >
>>>>>>>>>>      > This will help the consumer to create workers based on the
>>>>>>>>>> instruction features
>>>>>>>>>>      > which can NOT be abstracted as a generic feature across the
>>>>>>>>>> architectures.
>>>>>>>>>>
>>>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>>>
>>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>>>
>>>>>>>>
>>>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>>>
>>>>>>> That's works too. Some means of generic probing is fine. Following
>>>>>>> schemed needs
>>>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>>>> new instructions like
>>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>>>> supports it then
>>>>>>> the feature-get framework is good.
>>>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>>>> framework then
>>>>>>> we can keep the below scheme else generic arch feature is better for
>>>>>>> more forward
>>>>>>> looking use cases.
>>>>>>>
>>>>>>>> Might be just something like:
>>>>>>>>      rte_power_monitor(...) == -ENOTSUP
>>>>>>>> be enough indication for that?
>>>>>>>> So user can just do:
>>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>>>             /* not supported  path */
>>>>>>>> }
>>>>>>>>
>>>>>>>> To check is that feature supported or not.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>>>> it's functionally identical to waking up immediately.
>>>>>>
>>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>>>
>>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>>>
>>>>
>>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>>>> exist on other archs, this doesn't too, so it's a fairly similar
>>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>>>> equivalent to sleeping and then immediately waking up (which can happen
>>>> for a host of reasons unrelated to the code itself).
>>>
>>> If we are keeping the following return in the public API then it can not be NOP
>>> + * @return
>>> + *   - 1 if wakeup was due to TSC timeout expiration.
>>> + *   - 0 if wakeup was due to memory write or other reasons.
>>> + */
>>>
>>
>> In the generic header, it is specified that return value is
>> implementation-defined (i.e. arch-specific).
> 
> Obviously an API definition should *never* be "implementation-defined".

If there isn't a meaningful return value, we could either make it a 
void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid 
result for a UMWAIT, and there are no side-effects to the intrinsic 
itself (it's basically a fancy rte_pause).

> 
> 
>> I guess we could remove
>> that and set return value to either 0 or -ENOTSUP if that would resolve
>> the issue?
>>
>>> Also, we need to fix compilation issue if any with
>>> http://patches.dpdk.org/patch/79540/
>>> as it has direct reference to if
>>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>>> Either we need to add -ENOTSUP return or generic feature-get framework.
>>
>> IIRC power library isn't compiled on anything other than x86, so this
>> code wouldn't get compiled.
> 
> It is not call "power-x86", so we must assume it could work
> on any architecture.

#ifdef it is!

> 
> 
>>>> I'm not against a generic feature-get framework, i'm just pointing out
>>>> that if this is what's preventing the merge, it should prevent the merge
>>>> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
>>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> 
> CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
> that's why the expectations may be different.
> 

UMWAIT is a best-effort mechanism with no side-effects. It's perfectly 
legal for a UMWAIT to not sleep at all, thus rendering it effectively a 
noop. So i don't think it's all that different.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:22                                 ` Burakov, Anatoly
@ 2020-10-09 10:45                                   ` Jerin Jacob
  2020-10-09 10:48                                   ` Ananyev, Konstantin
  1 sibling, 0 replies; 421+ messages in thread
From: Jerin Jacob @ 2020-10-09 10:45 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Thomas Monjalon, Ananyev, Konstantin, David Marchand, Ma,
	Liang J, dpdk-dev, Hunt, David, Stephen Hemminger,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Fri, Oct 9, 2020 at 3:53 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> > 09/10/2020 12:03, Burakov, Anatoly:
> >> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> >>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> >>> <anatoly.burakov@intel.com> wrote:
> >>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>>>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>>>
> >>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>>>
> >>>>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Jerin,
> >>>>>>>>>
> >>>>>>>>> Hi Anatoly,
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>      > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>>>      > the consumer of this API is only supported on x86. Probably as
> >>>>>>>>>> functions[1]
> >>>>>>>>>>      > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>>>      > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>>>      >
> >>>>>>>>>>      > This will help the consumer to create workers based on the
> >>>>>>>>>> instruction features
> >>>>>>>>>>      > which can NOT be abstracted as a generic feature across the
> >>>>>>>>>> architectures.
> >>>>>>>>>>
> >>>>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>>>
> >>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>>>
> >>>>>>> That's works too. Some means of generic probing is fine. Following
> >>>>>>> schemed needs
> >>>>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>>>> new instructions like
> >>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>>>> supports it then
> >>>>>>> the feature-get framework is good.
> >>>>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>>>> framework then
> >>>>>>> we can keep the below scheme else generic arch feature is better for
> >>>>>>> more forward
> >>>>>>> looking use cases.
> >>>>>>>
> >>>>>>>> Might be just something like:
> >>>>>>>>      rte_power_monitor(...) == -ENOTSUP
> >>>>>>>> be enough indication for that?
> >>>>>>>> So user can just do:
> >>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>>>             /* not supported  path */
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> To check is that feature supported or not.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>>>> it's functionally identical to waking up immediately.
> >>>>>>
> >>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>>>
> >>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>>>
> >>>>
> >>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >>>> exist on other archs, this doesn't too, so it's a fairly similar
> >>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >>>> equivalent to sleeping and then immediately waking up (which can happen
> >>>> for a host of reasons unrelated to the code itself).
> >>>
> >>> If we are keeping the following return in the public API then it can not be NOP
> >>> + * @return
> >>> + *   - 1 if wakeup was due to TSC timeout expiration.
> >>> + *   - 0 if wakeup was due to memory write or other reasons.
> >>> + */
> >>>
> >>
> >> In the generic header, it is specified that return value is
> >> implementation-defined (i.e. arch-specific).
> >
> > Obviously an API definition should *never* be "implementation-defined".
>
> If there isn't a meaningful return value, we could either make it a
> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
> result for a UMWAIT, and there are no side-effects to the intrinsic
> itself (it's basically a fancy rte_pause).
>
> >
> >
> >> I guess we could remove
> >> that and set return value to either 0 or -ENOTSUP if that would resolve
> >> the issue?
> >>
> >>> Also, we need to fix compilation issue if any with
> >>> http://patches.dpdk.org/patch/79540/
> >>> as it has direct reference to if
> >>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> >>> Either we need to add -ENOTSUP return or generic feature-get framework.
> >>
> >> IIRC power library isn't compiled on anything other than x86, so this
> >> code wouldn't get compiled.
> >
> > It is not call "power-x86", so we must assume it could work
> > on any architecture.
>
> #ifdef it is!
>
> >
> >
> >>>> I'm not against a generic feature-get framework, i'm just pointing out
> >>>> that if this is what's preventing the merge, it should prevent the merge
> >>>> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> >>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> >
> > CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
> > that's why the expectations may be different.
> >
>
> UMWAIT is a best-effort mechanism with no side-effects. It's perfectly
> legal for a UMWAIT to not sleep at all, thus rendering it effectively a
> noop. So i don't think it's all that different.

If a platform does not support UMWAIT in ALL case IMO, no consumer takes this
the path for power saving. So IMO, t is different than CLDEMOTE

>
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:22                                 ` Burakov, Anatoly
  2020-10-09 10:45                                   ` Jerin Jacob
@ 2020-10-09 10:48                                   ` Ananyev, Konstantin
  2020-10-09 11:12                                     ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 10:48 UTC (permalink / raw)
  To: Burakov, Anatoly, Thomas Monjalon, Jerin Jacob
  Cc: David Marchand, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> > 09/10/2020 12:03, Burakov, Anatoly:
> >> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> >>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> >>> <anatoly.burakov@intel.com> wrote:
> >>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>>>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>>>
> >>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>>>
> >>>>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Jerin,
> >>>>>>>>>
> >>>>>>>>> Hi Anatoly,
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>      > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>>>      > the consumer of this API is only supported on x86. Probably as
> >>>>>>>>>> functions[1]
> >>>>>>>>>>      > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>>>      > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>>>      >
> >>>>>>>>>>      > This will help the consumer to create workers based on the
> >>>>>>>>>> instruction features
> >>>>>>>>>>      > which can NOT be abstracted as a generic feature across the
> >>>>>>>>>> architectures.
> >>>>>>>>>>
> >>>>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>>>
> >>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>>>
> >>>>>>> That's works too. Some means of generic probing is fine. Following
> >>>>>>> schemed needs
> >>>>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>>>> new instructions like
> >>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>>>> supports it then
> >>>>>>> the feature-get framework is good.
> >>>>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>>>> framework then
> >>>>>>> we can keep the below scheme else generic arch feature is better for
> >>>>>>> more forward
> >>>>>>> looking use cases.
> >>>>>>>
> >>>>>>>> Might be just something like:
> >>>>>>>>      rte_power_monitor(...) == -ENOTSUP
> >>>>>>>> be enough indication for that?
> >>>>>>>> So user can just do:
> >>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>>>             /* not supported  path */
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> To check is that feature supported or not.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>>>> it's functionally identical to waking up immediately.
> >>>>>>
> >>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>>>
> >>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>>>
> >>>>
> >>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >>>> exist on other archs, this doesn't too, so it's a fairly similar
> >>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >>>> equivalent to sleeping and then immediately waking up (which can happen
> >>>> for a host of reasons unrelated to the code itself).
> >>>
> >>> If we are keeping the following return in the public API then it can not be NOP
> >>> + * @return
> >>> + *   - 1 if wakeup was due to TSC timeout expiration.
> >>> + *   - 0 if wakeup was due to memory write or other reasons.
> >>> + */
> >>>
> >>
> >> In the generic header, it is specified that return value is
> >> implementation-defined (i.e. arch-specific).
> >
> > Obviously an API definition should *never* be "implementation-defined".
> 
> If there isn't a meaningful return value, we could either make it a
> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
> result for a UMWAIT, and there are no side-effects to the intrinsic
> itself (it's basically a fancy rte_pause).
> 
> >
> >
> >> I guess we could remove
> >> that and set return value to either 0 or -ENOTSUP if that would resolve
> >> the issue?
> >>
> >>> Also, we need to fix compilation issue if any with
> >>> http://patches.dpdk.org/patch/79540/
> >>> as it has direct reference to if
> >>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> >>> Either we need to add -ENOTSUP return or generic feature-get framework.
> >>
> >> IIRC power library isn't compiled on anything other than x86, so this
> >> code wouldn't get compiled.
> >
> > It is not call "power-x86", so we must assume it could work
> > on any architecture.
> 
> #ifdef it is!
> 
> >
> >
> >>>> I'm not against a generic feature-get framework, i'm just pointing out
> >>>> that if this is what's preventing the merge, it should prevent the merge
> >>>> of CLDEMOTE as well

I wouldn't consider these two as totally equal.
Yes, both are just hints to CPU, that can be ignored,
but if not, then consequences of executing are quite different.
If UMWAIT is not supported by cpu at all, then user might want to use some  
different power saving mechanism (pause, frequence scaling, etc.).
Without information is UMWAIT supported or not, user can't make
such choice.  

>, yet Jerin has acked that one and has explicitly
> >>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> >
> > CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
> > that's why the expectations may be different.
> >
> 
> UMWAIT is a best-effort mechanism with no side-effects. It's perfectly
> legal for a UMWAIT to not sleep at all, thus rendering it effectively a
> noop. So i don't think it's all that different.





^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 10:48                                   ` Ananyev, Konstantin
@ 2020-10-09 11:12                                     ` Burakov, Anatoly
  2020-10-09 11:36                                       ` Bruce Richardson
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 11:12 UTC (permalink / raw)
  To: Ananyev, Konstantin, Thomas Monjalon, Jerin Jacob
  Cc: David Marchand, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 11:48 AM, Ananyev, Konstantin wrote:
>> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
>>> 09/10/2020 12:03, Burakov, Anatoly:
>>>> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
>>>>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>>>>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Jerin,
>>>>>>>>>>>
>>>>>>>>>>> Hi Anatoly,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>       > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>>>>>       > the consumer of this API is only supported on x86. Probably as
>>>>>>>>>>>> functions[1]
>>>>>>>>>>>>       > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>>>>>       > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>>>>>       >
>>>>>>>>>>>>       > This will help the consumer to create workers based on the
>>>>>>>>>>>> instruction features
>>>>>>>>>>>>       > which can NOT be abstracted as a generic feature across the
>>>>>>>>>>>> architectures.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>>>>>
>>>>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>>>>>
>>>>>>>>> That's works too. Some means of generic probing is fine. Following
>>>>>>>>> schemed needs
>>>>>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>>>>>> new instructions like
>>>>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>>>>>> supports it then
>>>>>>>>> the feature-get framework is good.
>>>>>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>>>>>> framework then
>>>>>>>>> we can keep the below scheme else generic arch feature is better for
>>>>>>>>> more forward
>>>>>>>>> looking use cases.
>>>>>>>>>
>>>>>>>>>> Might be just something like:
>>>>>>>>>>       rte_power_monitor(...) == -ENOTSUP
>>>>>>>>>> be enough indication for that?
>>>>>>>>>> So user can just do:
>>>>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>>>>>              /* not supported  path */
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> To check is that feature supported or not.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>>>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>>>>>> it's functionally identical to waking up immediately.
>>>>>>>>
>>>>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>>>>>
>>>>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>>>>>
>>>>>>
>>>>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>>>>>> exist on other archs, this doesn't too, so it's a fairly similar
>>>>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>>>>>> equivalent to sleeping and then immediately waking up (which can happen
>>>>>> for a host of reasons unrelated to the code itself).
>>>>>
>>>>> If we are keeping the following return in the public API then it can not be NOP
>>>>> + * @return
>>>>> + *   - 1 if wakeup was due to TSC timeout expiration.
>>>>> + *   - 0 if wakeup was due to memory write or other reasons.
>>>>> + */
>>>>>
>>>>
>>>> In the generic header, it is specified that return value is
>>>> implementation-defined (i.e. arch-specific).
>>>
>>> Obviously an API definition should *never* be "implementation-defined".
>>
>> If there isn't a meaningful return value, we could either make it a
>> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
>> result for a UMWAIT, and there are no side-effects to the intrinsic
>> itself (it's basically a fancy rte_pause).
>>
>>>
>>>
>>>> I guess we could remove
>>>> that and set return value to either 0 or -ENOTSUP if that would resolve
>>>> the issue?
>>>>
>>>>> Also, we need to fix compilation issue if any with
>>>>> http://patches.dpdk.org/patch/79540/
>>>>> as it has direct reference to if
>>>>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>>>>> Either we need to add -ENOTSUP return or generic feature-get framework.
>>>>
>>>> IIRC power library isn't compiled on anything other than x86, so this
>>>> code wouldn't get compiled.
>>>
>>> It is not call "power-x86", so we must assume it could work
>>> on any architecture.
>>
>> #ifdef it is!
>>
>>>
>>>
>>>>>> I'm not against a generic feature-get framework, i'm just pointing out
>>>>>> that if this is what's preventing the merge, it should prevent the merge
>>>>>> of CLDEMOTE as well
> 
> I wouldn't consider these two as totally equal.
> Yes, both are just hints to CPU, that can be ignored,
> but if not, then consequences of executing are quite different.
> If UMWAIT is not supported by cpu at all, then user might want to use some
> different power saving mechanism (pause, frequence scaling, etc.).
> Without information is UMWAIT supported or not, user can't make
> such choice.

After some attempts at implementing this, i actually came to the 
conclusion that some generic way to check support for this feature is 
necessary, because we end up with a usability inconsistency:

1) on non-x86, if you call the function, it returns -ENOTSUP
2) on x86, since we're not checking CPUID flags on every single call, 
it'll either succeed, or crash the process - the burden is on the user 
to check for CPUID flags, but it can't be done in an arch agnostic way 
because the CPUID flags are only defined for x86, thus requiring a 
special code path for x86

Where would be the best place to add such an infrastructure? I'm 
thinking rte_cpuflags.h?

> 
>> , yet Jerin has acked that one and has explicitly
>>>>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
>>>
>>> CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
>>> that's why the expectations may be different.
>>>
>>
>> UMWAIT is a best-effort mechanism with no side-effects. It's perfectly
>> legal for a UMWAIT to not sleep at all, thus rendering it effectively a
>> noop. So i don't think it's all that different.
> 
> 
> 
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 11:12                                     ` Burakov, Anatoly
@ 2020-10-09 11:36                                       ` Bruce Richardson
  2020-10-09 11:42                                         ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Bruce Richardson @ 2020-10-09 11:36 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Ananyev, Konstantin, Thomas Monjalon, Jerin Jacob,
	David Marchand, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On Fri, Oct 09, 2020 at 12:12:56PM +0100, Burakov, Anatoly wrote:
> On 09-Oct-20 11:48 AM, Ananyev, Konstantin wrote:
> > > On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> > > > 09/10/2020 12:03, Burakov, Anatoly:
> > > > > On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> > > > > > On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> > > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > > On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> > > > > > > > 09/10/2020 11:25, Burakov, Anatoly:
> > > > > > > > > On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> > > > > > > > > > On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> > > > > > > > > > <konstantin.ananyev@intel.com> wrote:
> > > > > > > > > > > > On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> > > > > > > > > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > > > > > > > > On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > > > > > > > > > > > > > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Add two new power management intrinsics, and provide an implementation
> > > > > > > > > > > > > > > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > > > > > > > > > > > > > > are implemented as raw byte opcodes because there is not yet widespread
> > > > > > > > > > > > > > > > compiler support for these instructions.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > The power management instructions provide an architecture-specific
> > > > > > > > > > > > > > > > function to either wait until a specified TSC timestamp is reached, or
> > > > > > > > > > > > > > > > optionally wait until either a TSC timestamp is reached or a memory
> > > > > > > > > > > > > > > > location is written to. The monitor function also provides an optional
> > > > > > > > > > > > > > > > comparison, to avoid sleeping when the expected write has already
> > > > > > > > > > > > > > > > happened, and no more writes are expected.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > For more details, Please reference Intel SDM Volume 2.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I really would like to see feedbacks from other arch maintainers.
> > > > > > > > > > > > > > > Unfortunately they were not Cc'ed.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > > > > > > > > > > > > > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Also please mark the new functions as experimental.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hi Jerin,
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Anatoly,
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > >       > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> > > > > > > > > > > > >       > the consumer of this API is only supported on x86. Probably as
> > > > > > > > > > > > > functions[1]
> > > > > > > > > > > > >       > or macro flags scheme and have a stub for the other architectures as the
> > > > > > > > > > > > >       > API marked as generic ie rte_power_* not rte_x86_..
> > > > > > > > > > > > >       >
> > > > > > > > > > > > >       > This will help the consumer to create workers based on the
> > > > > > > > > > > > > instruction features
> > > > > > > > > > > > >       > which can NOT be abstracted as a generic feature across the
> > > > > > > > > > > > > architectures.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I'm not entirely sure what you mean by that.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I mean, yes, we should have added stubs for other architectures, and we
> > > > > > > > > > > > > will add those in future revisions, but what does your proposed runtime
> > > > > > > > > > > > > check accomplish that cannot currently be done with CPUID flags?
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> > > > > > > > > > > > i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> > > > > > > > > > > > and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> > > > > > > > > > > > I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > I am agree with Jerin, that we need some generic way to
> > > > > > > > > > > figure-out does platform supports power_monitor() or not.
> > > > > > > > > > > Though not sure do we need to create a new feature-get framework here...
> > > > > > > > > > 
> > > > > > > > > > That's works too. Some means of generic probing is fine. Following
> > > > > > > > > > schemed needs
> > > > > > > > > > more documentation on that usage, as, it is not straight forward compare to
> > > > > > > > > > feature-get framework. Also, on the other thread, we are adding the
> > > > > > > > > > new instructions like
> > > > > > > > > > demote cacheline etc, maybe if the user wants to KNOW if the arch
> > > > > > > > > > supports it then
> > > > > > > > > > the feature-get framework is good.
> > > > > > > > > > If we think, there is no other usecase for generic arch feature-get
> > > > > > > > > > framework then
> > > > > > > > > > we can keep the below scheme else generic arch feature is better for
> > > > > > > > > > more forward
> > > > > > > > > > looking use cases.
> > > > > > > > > > 
> > > > > > > > > > > Might be just something like:
> > > > > > > > > > >       rte_power_monitor(...) == -ENOTSUP
> > > > > > > > > > > be enough indication for that?
> > > > > > > > > > > So user can just do:
> > > > > > > > > > > if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> > > > > > > > > > >              /* not supported  path */
> > > > > > > > > > > }
> > > > > > > > > > > 
> > > > > > > > > > > To check is that feature supported or not.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> > > > > > > > > we can safely make this intrinsic as a noop on other archs as well, as
> > > > > > > > > it's functionally identical to waking up immediately.
> > > > > > > > > 
> > > > > > > > > If we're not creating this for CLDEMOTE, we don't need it here as well.
> > > > > > > > > If we do need it for this, then we arguably need it for CLDEMOTE too.
> > > > > > > > 
> > > > > > > > Sorry I don't understand what you mean, too many "it" and "this" :)
> > > > > > > > 
> > > > > > > 
> > > > > > > Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> > > > > > > exist on other archs, this doesn't too, so it's a fairly similar
> > > > > > > situation. Stubbing UMWAIT with a noop is a valid approach because it's
> > > > > > > equivalent to sleeping and then immediately waking up (which can happen
> > > > > > > for a host of reasons unrelated to the code itself).
> > > > > > 
> > > > > > If we are keeping the following return in the public API then it can not be NOP
> > > > > > + * @return
> > > > > > + *   - 1 if wakeup was due to TSC timeout expiration.
> > > > > > + *   - 0 if wakeup was due to memory write or other reasons.
> > > > > > + */
> > > > > > 
> > > > > 
> > > > > In the generic header, it is specified that return value is
> > > > > implementation-defined (i.e. arch-specific).
> > > > 
> > > > Obviously an API definition should *never* be "implementation-defined".
> > > 
> > > If there isn't a meaningful return value, we could either make it a
> > > void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
> > > result for a UMWAIT, and there are no side-effects to the intrinsic
> > > itself (it's basically a fancy rte_pause).
> > > 
> > > > 
> > > > 
> > > > > I guess we could remove
> > > > > that and set return value to either 0 or -ENOTSUP if that would resolve
> > > > > the issue?
> > > > > 
> > > > > > Also, we need to fix compilation issue if any with
> > > > > > http://patches.dpdk.org/patch/79540/
> > > > > > as it has direct reference to if
> > > > > > (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> > > > > > Either we need to add -ENOTSUP return or generic feature-get framework.
> > > > > 
> > > > > IIRC power library isn't compiled on anything other than x86, so this
> > > > > code wouldn't get compiled.
> > > > 
> > > > It is not call "power-x86", so we must assume it could work
> > > > on any architecture.
> > > 
> > > #ifdef it is!
> > > 
> > > > 
> > > > 
> > > > > > > I'm not against a generic feature-get framework, i'm just pointing out
> > > > > > > that if this is what's preventing the merge, it should prevent the merge
> > > > > > > of CLDEMOTE as well
> > 
> > I wouldn't consider these two as totally equal.
> > Yes, both are just hints to CPU, that can be ignored,
> > but if not, then consequences of executing are quite different.
> > If UMWAIT is not supported by cpu at all, then user might want to use some
> > different power saving mechanism (pause, frequence scaling, etc.).
> > Without information is UMWAIT supported or not, user can't make
> > such choice.
> 
> After some attempts at implementing this, i actually came to the conclusion
> that some generic way to check support for this feature is necessary,
> because we end up with a usability inconsistency:
> 
> 1) on non-x86, if you call the function, it returns -ENOTSUP
> 2) on x86, since we're not checking CPUID flags on every single call, it'll
> either succeed, or crash the process - the burden is on the user to check
> for CPUID flags, but it can't be done in an arch agnostic way because the
> CPUID flags are only defined for x86, thus requiring a special code path for
> x86
> 
> Where would be the best place to add such an infrastructure? I'm thinking
> rte_cpuflags.h?
> 
Time to relook at some of the contents of patchset:
http://patches.dpdk.org/project/dpdk/list/?series=4811&archive=both&state=*

The idea of that set (IIRC) was to replace the per-architecture enums with
just strings to avoid situations like this - or at least make them less
awkward.

/Bruce

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 11:36                                       ` Bruce Richardson
@ 2020-10-09 11:42                                         ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 11:42 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Ananyev, Konstantin, Thomas Monjalon, Jerin Jacob,
	David Marchand, Ma, Liang J, dpdk-dev, Hunt, David,
	Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	David Christensen, Jerin Jacob

On 09-Oct-20 12:36 PM, Bruce Richardson wrote:
> On Fri, Oct 09, 2020 at 12:12:56PM +0100, Burakov, Anatoly wrote:
>> On 09-Oct-20 11:48 AM, Ananyev, Konstantin wrote:
>>>> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
>>>>> 09/10/2020 12:03, Burakov, Anatoly:
>>>>>> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
>>>>>>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>>>>>>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>>>>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>>>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>>>>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Jerin,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Anatoly,
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>>>>>>>        > the consumer of this API is only supported on x86. Probably as
>>>>>>>>>>>>>> functions[1]
>>>>>>>>>>>>>>        > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>>>>>>>        > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>>>>>>>        >
>>>>>>>>>>>>>>        > This will help the consumer to create workers based on the
>>>>>>>>>>>>>> instruction features
>>>>>>>>>>>>>>        > which can NOT be abstracted as a generic feature across the
>>>>>>>>>>>>>> architectures.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>>>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>>>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>>>>>>>
>>>>>>>>>>> That's works too. Some means of generic probing is fine. Following
>>>>>>>>>>> schemed needs
>>>>>>>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>>>>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>>>>>>>> new instructions like
>>>>>>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>>>>>>>> supports it then
>>>>>>>>>>> the feature-get framework is good.
>>>>>>>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>>>>>>>> framework then
>>>>>>>>>>> we can keep the below scheme else generic arch feature is better for
>>>>>>>>>>> more forward
>>>>>>>>>>> looking use cases.
>>>>>>>>>>>
>>>>>>>>>>>> Might be just something like:
>>>>>>>>>>>>        rte_power_monitor(...) == -ENOTSUP
>>>>>>>>>>>> be enough indication for that?
>>>>>>>>>>>> So user can just do:
>>>>>>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>>>>>>>               /* not supported  path */
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> To check is that feature supported or not.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>>>>>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>>>>>>>> it's functionally identical to waking up immediately.
>>>>>>>>>>
>>>>>>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>>>>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>>>>>>>
>>>>>>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>>>>>>>
>>>>>>>>
>>>>>>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>>>>>>>> exist on other archs, this doesn't too, so it's a fairly similar
>>>>>>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>>>>>>>> equivalent to sleeping and then immediately waking up (which can happen
>>>>>>>> for a host of reasons unrelated to the code itself).
>>>>>>>
>>>>>>> If we are keeping the following return in the public API then it can not be NOP
>>>>>>> + * @return
>>>>>>> + *   - 1 if wakeup was due to TSC timeout expiration.
>>>>>>> + *   - 0 if wakeup was due to memory write or other reasons.
>>>>>>> + */
>>>>>>>
>>>>>>
>>>>>> In the generic header, it is specified that return value is
>>>>>> implementation-defined (i.e. arch-specific).
>>>>>
>>>>> Obviously an API definition should *never* be "implementation-defined".
>>>>
>>>> If there isn't a meaningful return value, we could either make it a
>>>> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
>>>> result for a UMWAIT, and there are no side-effects to the intrinsic
>>>> itself (it's basically a fancy rte_pause).
>>>>
>>>>>
>>>>>
>>>>>> I guess we could remove
>>>>>> that and set return value to either 0 or -ENOTSUP if that would resolve
>>>>>> the issue?
>>>>>>
>>>>>>> Also, we need to fix compilation issue if any with
>>>>>>> http://patches.dpdk.org/patch/79540/
>>>>>>> as it has direct reference to if
>>>>>>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>>>>>>> Either we need to add -ENOTSUP return or generic feature-get framework.
>>>>>>
>>>>>> IIRC power library isn't compiled on anything other than x86, so this
>>>>>> code wouldn't get compiled.
>>>>>
>>>>> It is not call "power-x86", so we must assume it could work
>>>>> on any architecture.
>>>>
>>>> #ifdef it is!
>>>>
>>>>>
>>>>>
>>>>>>>> I'm not against a generic feature-get framework, i'm just pointing out
>>>>>>>> that if this is what's preventing the merge, it should prevent the merge
>>>>>>>> of CLDEMOTE as well
>>>
>>> I wouldn't consider these two as totally equal.
>>> Yes, both are just hints to CPU, that can be ignored,
>>> but if not, then consequences of executing are quite different.
>>> If UMWAIT is not supported by cpu at all, then user might want to use some
>>> different power saving mechanism (pause, frequence scaling, etc.).
>>> Without information is UMWAIT supported or not, user can't make
>>> such choice.
>>
>> After some attempts at implementing this, i actually came to the conclusion
>> that some generic way to check support for this feature is necessary,
>> because we end up with a usability inconsistency:
>>
>> 1) on non-x86, if you call the function, it returns -ENOTSUP
>> 2) on x86, since we're not checking CPUID flags on every single call, it'll
>> either succeed, or crash the process - the burden is on the user to check
>> for CPUID flags, but it can't be done in an arch agnostic way because the
>> CPUID flags are only defined for x86, thus requiring a special code path for
>> x86
>>
>> Where would be the best place to add such an infrastructure? I'm thinking
>> rte_cpuflags.h?
>>
> Time to relook at some of the contents of patchset:
> http://patches.dpdk.org/project/dpdk/list/?series=4811&archive=both&state=*
> 
> The idea of that set (IIRC) was to replace the per-architecture enums with
> just strings to avoid situations like this - or at least make them less
> awkward.
> 
> /Bruce
> 

Yes, that patchset looks like it would work nicely in this case. If 
there is consensus to resurrect it and make it work, i'll drop the 
"generic feature get" thing, but for now i'll continue working on it - 
easier to remove code than to add it :D

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09  9:11           ` Burakov, Anatoly
@ 2020-10-09 15:39             ` Ananyev, Konstantin
  2020-10-09 16:10               ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 15:39 UTC (permalink / raw)
  To: Burakov, Anatoly, Ma, Liang J, dev; +Cc: Hunt, David, stephen


> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
> >>
> >> Add two new power management intrinsics, and provide an implementation
> >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >> are implemented as raw byte opcodes because there is not yet widespread
> >> compiler support for these instructions.
> >>
> >> The power management instructions provide an architecture-specific
> >> function to either wait until a specified TSC timestamp is reached, or
> >> optionally wait until either a TSC timestamp is reached or a memory
> >> location is written to. The monitor function also provides an optional
> >> comparison, to avoid sleeping when the expected write has already
> >> happened, and no more writes are expected.
> >
> > I think what this API is missing - a function to wakeup sleeping core.
> > If user can/should use some system call to achieve that, then at least
> > it has to be clearly documented, even better some wrapper provided.
> 
> I don't think it's possible to do that without severely overcomplicating
> the intrinsic and its usage, because AFAIK the only way to wake up a
> sleeping core would be to send some kind of interrupt to the core, or
> trigger a write to the cache-line in question.
> 

Yes, I think we either need a syscall that would do an IPI for us
(on top of my head - membarrier() does that, might be there are some other syscalls too),
or something hand-made. For hand-made, I wonder would something like that
be safe and sufficient:
uint64_t val = atomic_load(addr);
CAS(addr, val, &val);
?
Anyway, one way or another - I think ability to wakeup core we put to sleep
have to be an essential part of this feature. 
As I understand linux kernel will limit max amount of sleep time for these instructions:
https://lwn.net/Articles/790920/
But relying just on that, seems too vague for me:
- user can adjust that value
- wouldn't apply to older kernels and non-linux cases
Konstantin






^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 05/10] net/ixgbe: implement power management API Liang Ma
@ 2020-10-09 15:53         ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 15:53 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly

> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's
> status bit.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 06/10] net/i40e: implement power management API
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 06/10] net/i40e: " Liang Ma
@ 2020-10-09 16:01         ` Ananyev, Konstantin
  0 siblings, 0 replies; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 16:01 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly

> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's
> status bit.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.17.1


^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (10 preceding siblings ...)
  2020-10-08 22:08       ` Ananyev, Konstantin
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-09 17:06         ` Burakov, Anatoly
                           ` (10 more replies)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics Anatoly Burakov
                         ` (8 subsequent siblings)
  20 siblings, 11 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Bruce Richardson, Konstantin Ananyev, david.hunt,
	jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add new x86 cpuid support for WAITPKG.
This flag indicate processor support umwait/umonitor/tpause
instruction.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_eal/x86/include/rte_cpuflags.h | 2 ++
 lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..5041a830a7 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	/**< UMWAIT/TPAUSE Instructions */
+	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (11 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-09 16:09         ` Jerin Jacob
  2020-10-12 19:47         ` David Christensen
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
                         ` (7 subsequent siblings)
  20 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jan Viktorin, Ruifeng Wang, David Christensen,
	Bruce Richardson, Konstantin Ananyev, david.hunt, jerinjacobk,
	thomas, timothy.mcdaniel, gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

For more details, please refer to Intel(R) 64 and IA-32 Architectures
Software Developer's Manual, Volume 2.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Removed return values
    - Simplified intrinsics and hardcoded C0.2 state
    - Added other arch stubs

 lib/librte_eal/arm/include/meson.build        |   1 +
 .../arm/include/rte_power_intrinsics.h        |  62 ++++++++++
 .../include/generic/rte_power_intrinsics.h    |  61 ++++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/ppc/include/meson.build        |   1 +
 .../ppc/include/rte_power_intrinsics.h        |  62 ++++++++++
 lib/librte_eal/x86/include/meson.build        |   1 +
 .../x86/include/rte_power_intrinsics.h        | 106 ++++++++++++++++++
 8 files changed, 295 insertions(+)
 create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
index 73b750a18f..c6a9f70d73 100644
--- a/lib/librte_eal/arm/include/meson.build
+++ b/lib/librte_eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
 	'rte_pause_32.h',
 	'rte_pause_64.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch_32.h',
 	'rte_prefetch_64.h',
 	'rte_prefetch.h',
diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..4aad44a0b9
--- /dev/null
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_ARM_H_
+#define _RTE_POWER_INTRINSIC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on ARM.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 0 on success
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+}
+
+/**
+ * This function is not supported on ARM.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..e36c1f8976
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   - 0 on success
+ *   - -ENOTSUP if not supported
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
index ab4bd28092..0873b2aecb 100644
--- a/lib/librte_eal/ppc/include/meson.build
+++ b/lib/librte_eal/ppc/include/meson.build
@@ -10,6 +10,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_pause.h',
+	'rte_power_intrinsics.h',
 	'rte_prefetch.h',
 	'rte_rwlock.h',
 	'rte_spinlock.h',
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..70fd7b094f
--- /dev/null
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_PPC_H_
+#define _RTE_POWER_INTRINSIC_PPC_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * This function is not supported on PPC64.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 0 on success
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(p);
+	RTE_SET_USED(expected_value);
+	RTE_SET_USED(value_mask);
+	RTE_SET_USED(tsc_timestamp);
+}
+
+/**
+ * This function is not supported on PPC64.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(tsc_timestamp);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..8d579eaf64
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
+ * For more information about usage of these instructions, please refer to
+ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 0 on success
+ */
+static inline void rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
+		: /* ignore rflags */
+		: "D"(0), /* enter C0.2 */
+		  "a"(tsc_l), "d"(tsc_h));
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction  and will enter C0.2 state. For more
+ * information about usage of this instruction, please refer to Intel(R) 64 and
+ * IA-32 Architectures Software Developer's Manual.
+ *
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline void rte_power_pause(const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
+		     : /* ignore rflags */
+		     : "D"(0), /* enter C0.2 */
+		       "a"(tsc_l), "d"(tsc_h));
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (12 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-11 10:07         ` Jerin Jacob
  2020-10-12 19:52         ` David Christensen
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API Anatoly Burakov
                         ` (6 subsequent siblings)
  20 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Ray Kinsella,
	Neil Horman, Bruce Richardson, Konstantin Ananyev, david.hunt,
	liang.j.ma, jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

Currently, it is not possible to check support for intrinsics that
are platform-specific, cannot be abstracted in a generic way, or do not
have support on all architectures. The CPUID flags can be used to some
extent, but they are only defined for their platform, while intrinsics
will be available to all code as they are in generic headers.

This patch introduces infrastructure to check support for certain
platform-specific intrinsics, and adds support for checking support for
IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../arm/include/rte_power_intrinsics.h        |  8 ++++++
 lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
 lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
 .../include/generic/rte_power_intrinsics.h    |  8 ++++++
 .../ppc/include/rte_power_intrinsics.h        |  8 ++++++
 lib/librte_eal/ppc/rte_cpuflags.c             |  6 +++++
 lib/librte_eal/rte_eal_version.map            |  1 +
 .../x86/include/rte_power_intrinsics.h        |  8 ++++++
 lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
 9 files changed, 83 insertions(+)

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index 4aad44a0b9..055ec5877a 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -17,6 +17,10 @@ extern "C" {
 /**
  * This function is not supported on ARM.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
  * @param expected_value
@@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
 /**
  * This function is not supported on ARM.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for.
  *
diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
index caf3dc83a5..7eef11fa02 100644
--- a/lib/librte_eal/arm/rte_cpuflags.c
+++ b/lib/librte_eal/arm/rte_cpuflags.c
@@ -138,3 +138,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
index 872f0ebe3e..28a5aecde8 100644
--- a/lib/librte_eal/include/generic/rte_cpuflags.h
+++ b/lib/librte_eal/include/generic/rte_cpuflags.h
@@ -13,6 +13,32 @@
 #include "rte_common.h"
 #include <errno.h>
 
+#include <rte_compat.h>
+
+/**
+ * Structure used to describe platform-specific intrinsics that may or may not
+ * be supported at runtime.
+ */
+struct rte_cpu_intrinsics {
+	uint32_t power_monitor : 1;
+	/**< indicates support for rte_power_monitor function */
+	uint32_t power_pause : 1;
+	/**< indicates support for rte_power_pause function */
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Check CPU support for various intrinsics at runtime.
+ *
+ * @param intrinsics
+ *     Pointer to a structure to be filled.
+ */
+__rte_experimental
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
+
 /**
  * Enumeration of all CPU features supported
  */
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index e36c1f8976..218eda7e86 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -26,6 +26,10 @@
  * checked against the expected value, and if they match, the entering of
  * optimized power state may be aborted.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
  * @param expected_value
@@ -49,6 +53,10 @@ static inline void rte_power_monitor(const volatile void *p,
  * Enter an architecture-defined optimized power state until a certain TSC
  * timestamp is reached.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for. Note that the wait behavior is
  *   architecture-dependent.
diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
index 70fd7b094f..d63ad86849 100644
--- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
@@ -17,6 +17,10 @@ extern "C" {
 /**
  * This function is not supported on PPC64.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
  * @param expected_value
@@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
 /**
  * This function is not supported on PPC64.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for.
  *
diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
index 3bb7563ce9..eee8234384 100644
--- a/lib/librte_eal/ppc/rte_cpuflags.c
+++ b/lib/librte_eal/ppc/rte_cpuflags.c
@@ -108,3 +108,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+}
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index a93dea9fe6..ed944f2bd4 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -400,6 +400,7 @@ EXPERIMENTAL {
 	# added in 20.11
 	__rte_eal_trace_generic_size_t;
 	rte_service_lcore_may_be_active;
+	rte_cpu_get_intrinsics_support;
 };
 
 INTERNAL {
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
index 8d579eaf64..3afc165a1f 100644
--- a/lib/librte_eal/x86/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -29,6 +29,10 @@ extern "C" {
  * For more information about usage of these instructions, please refer to
  * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param p
  *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
  * @param expected_value
@@ -80,6 +84,10 @@ static inline void rte_power_monitor(const volatile void *p,
  * information about usage of this instruction, please refer to Intel(R) 64 and
  * IA-32 Architectures Software Developer's Manual.
  *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
+ *   so may result in an illegal CPU instruction error.
+ *
  * @param tsc_timestamp
  *   Maximum TSC timestamp to wait for.
  *
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 0325c4b93b..a96312ff7f 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -7,6 +7,7 @@
 #include <stdio.h>
 #include <errno.h>
 #include <stdint.h>
+#include <string.h>
 
 #include "rte_cpuid.h"
 
@@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
 		return NULL;
 	return rte_cpu_feature_table[feature].name;
 }
+
+void
+rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
+{
+	memset(intrinsics, 0, sizeof(*intrinsics));
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
+		intrinsics->power_monitor = 1;
+		intrinsics->power_pause = 1;
+	}
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (13 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-14  3:10         ` Guo, Jia
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 05/10] power: add PMD power management API and callback Anatoly Burakov
                         ` (5 subsequent siblings)
  20 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple API to allow getting address of next RX descriptor from the
PMD, as well as release notes information.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Bring function format in line with other functions in the file
    - Ensure the API is supported by the driver before calling it (Konstantin)

 doc/guides/rel_notes/release_20_11.rst   | 16 ++++++++++++++
 lib/librte_ethdev/rte_ethdev.c           | 17 ++++++++++++++
 lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_version.map |  1 +
 5 files changed, 86 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 808bdc4e54..e85af5d3e9 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **ethdev: add 1 new EXPERIMENTAL API for PMD power management.**
+
+  * ``rte_eth_get_wake_addr()``
+  * add new eth_dev_ops ``get_wake_addr``
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
@@ -136,6 +141,17 @@ New Features
   * Extern objects and functions can be plugged into the pipeline.
   * Transaction-oriented table updates.
 
+* **Add PMD power management mechanism**
+
+  3 new Ethernet PMD power management mechanism is added through existing
+  RX callback infrastructure.
+
+  * Add power saving scheme based on UMWAIT instruction (x86 only)
+  * Add power saving scheme based on ``rte_pause()``
+  * Add power saving scheme based on frequency scaling through the power library
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 48d1333b17..352108f43c 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -4804,6 +4804,23 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 		       dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
 }
 
+int
+rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+		volatile void **wake_addr, uint64_t *expected, uint64_t *mask)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -ENOTSUP);
+
+	return eth_err(port_id,
+		dev->dev_ops->get_wake_addr(dev->data->rx_queues[queue_id],
+			wake_addr, expected, mask));
+}
+
 int
 rte_eth_dev_set_mc_addr_list(uint16_t port_id,
 			     struct rte_ether_addr *mc_addr_set,
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index d2bf74f128..a6cfe3cd57 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4014,6 +4014,30 @@ __rte_experimental
 int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_burst_mode *mode);
 
+/**
+ * Retrieve the wake up address from specific queue
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Tx queue on the Ethernet device for which information
+ *   will be retrieved.
+ * @param wake_addr
+ *   The pointer point to the address which is used for monitoring.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ *
+ * @return
+ *   - 0: Success.
+ *   -EINVAL: Failed to get wake address.
+ */
+__rte_experimental
+int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
+			  volatile void **wake_addr,
+			  uint64_t *expected, uint64_t *mask);
+
 /**
  * Retrieve device registers and register attributes (number of registers and
  * register size)
diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h
index c3062c246c..935d46f25c 100644
--- a/lib/librte_ethdev/rte_ethdev_driver.h
+++ b/lib/librte_ethdev/rte_ethdev_driver.h
@@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the Wake up address.
+ *
+ * @param rxq
+ *   Ethdev queue pointer.
+ * @param tail_desc_addr
+ *   The pointer point to descriptor address var.
+ * @param expected
+ *   The pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   The pointer point to comparison bitmask for the expected value.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_get_wake_addr_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -713,6 +738,9 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_get_wake_addr_t get_wake_addr;
+	/**< Get wake up address. */
+
 };
 
 /**
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index c95ef5157a..3cb2093980 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -229,6 +229,7 @@ EXPERIMENTAL {
 	# added in 20.11
 	rte_eth_link_speed_to_str;
 	rte_eth_link_to_str;
+	rte_eth_get_wake_addr;
 };
 
 INTERNAL {
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 05/10] power: add PMD power management API and callback
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (14 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API Anatoly Burakov
                         ` (4 subsequent siblings)
  20 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman,
	konstantin.ananyev, jerinjacobk, bruce.richardson, thomas,
	timothy.mcdaniel, gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. Pause instruction

   Instead of move the core into deeper C state, this method uses the
   pause instruction to avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
      - Add some debug logging
    - Replace x86-specific code path to generic path using the intrinsic check

 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/pmd_mgmt.h            |  38 ++++
 lib/librte_power/rte_power_pmd_mgmt.c  | 244 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  88 +++++++++
 lib/librte_power/rte_power_version.map |   4 +
 5 files changed, 377 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/pmd_mgmt.h
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h

diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/pmd_mgmt.h b/lib/librte_power/pmd_mgmt.h
new file mode 100644
index 0000000000..20be53bacf
--- /dev/null
+++ b/lib/librte_power/pmd_mgmt.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _PMD_MGMT_H
+#define _PMD_MGMT_H
+
+/**
+ * @file
+ * Power Management
+ */
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< Power mgmt Callback mode */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Empty poll number */
+	uint16_t empty_poll_stats;
+	/**< Callback instance  */
+	const struct rte_eth_rxtx_callback *cur_cb;
+} __rte_cache_aligned;
+
+struct pmd_port_cfg {
+	int  ref_cnt;
+	struct pmd_queue_cfg *queue_cfg;
+} __rte_cache_aligned;
+
+#endif
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..07dfe7c077
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,244 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+#include "pmd_mgmt.h"
+
+
+#define EMPTYPOLL_MAX  512
+#define PAUSE_NUM  64
+
+static struct pmd_port_cfg port_cfg[RTE_MAX_ETHPORTS];
+
+static uint16_t
+rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = rte_eth_get_wake_addr(port_id, qidx,
+					&target_addr, &expected, &mask);
+			if (ret == 0)
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor(target_addr, expected,
+						mask, -1ULL);
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+	int i;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			for (i = 0; i < PAUSE_NUM; i++)
+				rte_pause();
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+	q_conf = &port_cfg[port_id].queue_cfg[qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			/*scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+
+		}
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scal up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret = 0;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (port_cfg[port_id].queue_cfg == NULL) {
+		port_cfg[port_id].ref_cnt = 0;
+		/* allocate memory for empty poll stats */
+		port_cfg[port_id].queue_cfg  = rte_malloc_socket(NULL,
+					sizeof(struct pmd_queue_cfg)
+					* RTE_MAX_QUEUES_PER_PORT,
+					0, dev->data->numa_node);
+		if (port_cfg[port_id].queue_cfg == NULL)
+			return -ENOMEM;
+	}
+
+	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto failure_handler;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	{
+		/* check if rte_power_monitor is supported */
+		uint64_t dummy_expected, dummy_mask;
+		struct rte_cpu_intrinsics i;
+		void *dummy_addr;
+
+		rte_cpu_get_intrinsics_support(&i);
+
+		if (!i.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto failure_handler;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_wake_addr(port_id, queue_id, &dummy_addr,
+				&dummy_expected, &dummy_mask) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support get_wake_addr\n");
+			ret = -ENOTSUP;
+			goto failure_handler;
+		}
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+						rte_power_mgmt_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+			!rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto failure_handler;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto failure_handler;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto failure_handler;
+		}
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+					rte_power_mgmt_scalefreq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+						rte_power_mgmt_pause, NULL);
+		break;
+	}
+	queue_cfg->cb_mode = mode;
+	port_cfg[port_id].ref_cnt++;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	return ret;
+
+failure_handler:
+	if (port_cfg[port_id].ref_cnt == 0) {
+		rte_free(port_cfg[port_id].queue_cfg);
+		port_cfg[port_id].queue_cfg = NULL;
+	}
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+
+	if (port_cfg[port_id].ref_cnt <= 0)
+		return -EINVAL;
+
+	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED)
+		return -EINVAL;
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+					   queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+					   queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/* it's not recommend to free callback instance here.
+	 * it cause memory leak which is a known issue.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	port_cfg[port_id].ref_cnt--;
+
+	if (port_cfg[port_id].ref_cnt == 0) {
+		rte_free(port_cfg[port_id].queue_cfg);
+		port_cfg[port_id].queue_cfg = NULL;
+	}
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..8b110f1148
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (15 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 05/10] power: add PMD power management API and callback Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-12  7:46         ` Wang, Haiyue
  2020-10-12  8:09         ` Wang, Haiyue
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 07/10] net/i40e: " Anatoly Burakov
                         ` (3 subsequent siblings)
  20 siblings, 2 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Jeff Guo, Haiyue Wang, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 0b98e210e7..30b3f416d4 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.get_wake_addr        = ixgbe_get_wake_addr,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf5137..7a9fd2aec6 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b22..75020fa2fc 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 07/10] net/i40e: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (16 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-14  3:19         ` Guo, Jia
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 08/10] net/ice: " Anatoly Burakov
                         ` (2 subsequent siblings)
  20 siblings, 1 reply; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Beilei Xing, Jeff Guo, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 943cfe71dc..cab86f8ec9 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.get_wake_addr	              = i40e_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 322fc1ed75..c17f27292f 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..f23a2073e3 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 08/10] net/ice: implement power management API
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (17 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 07/10] net/i40e: " Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 10/10] doc: update programmer's guide for power library Anatoly Burakov
  20 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Qiming Yang, Qi Zhang, david.hunt, konstantin.ananyev,
	jerinjacobk, bruce.richardson, thomas, timothy.mcdaniel,
	gage.eads, chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Implement support for the power management API by implementing a
`get_wake_addr` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index d8ce09d28f..260de5dfd7 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.get_wake_addr	              = ice_get_wake_addr,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 93a0ac6918..9e55eca942 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -25,6 +25,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint8_t
 ice_proto_xtr_type_to_rxdid(uint8_t xtr_type)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 1c23c7541e..c729e474c9 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
+		      uint64_t *expected, uint64_t *mask);
 
 #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
 	int i; \
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 09/10] examples/l3fwd-power: enable PMD power mgmt
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (18 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 08/10] net/ice: " Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 10/10] doc: update programmer's guide for power library Anatoly Burakov
  20 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, konstantin.ananyev, jerinjacobk,
	bruce.richardson, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Add PMD power management feature support to l3fwd-power sample app.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Moved doc update here
    - Some minor formatting fixes

 .../sample_app_ug/l3_forward_power_man.rst    | 13 ++++++
 examples/l3fwd-power/main.c                   | 41 ++++++++++++++++++-
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 0cc6f2e62e..8722fbaeaa 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -109,6 +109,8 @@ where,
 
 *   --telemetry:  Telemetry mode.
 
+*   --pmd-mgmt: PMD power management mode.
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -459,3 +461,14 @@ reference cycles and accordingly busy rate is set  to either 0% or
 
 The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script
 ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``.
+
+PMD power management Mode
+-------------------------
+
+The PMD power management  mode support for ``l3fwd-power`` is a standalone mode, in this mode
+``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific
+port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.
+
+.. code-block:: console
+
+        ./examples/l3fwd-power/build/l3fwd-power --pmd-mgmt -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)"
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index d0e6c9bd77..af64dd521f 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_power_empty_poll.h>
 #include <rte_metrics.h>
 #include <rte_telemetry.h>
+#include <rte_power_pmd_mgmt.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -199,7 +200,8 @@ enum appmode {
 	APP_MODE_LEGACY,
 	APP_MODE_EMPTY_POLL,
 	APP_MODE_TELEMETRY,
-	APP_MODE_INTERRUPT
+	APP_MODE_INTERRUPT,
+	APP_MODE_PMD_MGMT
 };
 
 enum appmode app_mode;
@@ -1750,6 +1752,7 @@ parse_ep_config(const char *q_arg)
 #define CMD_LINE_OPT_EMPTY_POLL "empty-poll"
 #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only"
 #define CMD_LINE_OPT_TELEMETRY "telemetry"
+#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt"
 
 /* Parse the argument given in the command line of the application */
 static int
@@ -1771,6 +1774,7 @@ parse_args(int argc, char **argv)
 		{CMD_LINE_OPT_LEGACY, 0, 0, 0},
 		{CMD_LINE_OPT_TELEMETRY, 0, 0, 0},
 		{CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0},
+		{CMD_LINE_OPT_PMD_MGMT, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -1881,6 +1885,16 @@ parse_args(int argc, char **argv)
 				printf("telemetry mode is enabled\n");
 			}
 
+			if (!strncmp(lgopts[option_index].name,
+					CMD_LINE_OPT_PMD_MGMT,
+					sizeof(CMD_LINE_OPT_PMD_MGMT))) {
+				if (app_mode != APP_MODE_DEFAULT) {
+					printf(" power mgmt mode is mutually exclusive with other modes\n");
+					return -1;
+				}
+				app_mode = APP_MODE_PMD_MGMT;
+				printf("PMD power mgmt  mode is enabled\n");
+			}
 			if (!strncmp(lgopts[option_index].name,
 					CMD_LINE_OPT_INTERRUPT_ONLY,
 					sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) {
@@ -2437,6 +2451,8 @@ mode_to_str(enum appmode mode)
 		return "telemetry";
 	case APP_MODE_INTERRUPT:
 		return "interrupt-only";
+	case APP_MODE_PMD_MGMT:
+		return "pmd mgmt";
 	default:
 		return "invalid";
 	}
@@ -2705,6 +2721,12 @@ main(int argc, char **argv)
 			} else if (!check_ptype(portid))
 				rte_exit(EXIT_FAILURE,
 					 "PMD can not provide needed ptypes\n");
+			if (app_mode == APP_MODE_PMD_MGMT) {
+				rte_power_pmd_mgmt_queue_enable(lcore_id,
+							portid, queueid,
+						RTE_POWER_MGMT_TYPE_SCALE);
+
+			}
 		}
 	}
 
@@ -2790,6 +2812,9 @@ main(int argc, char **argv)
 						SKIP_MASTER);
 	} else if (app_mode == APP_MODE_INTERRUPT) {
 		rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MASTER);
+	} else if (app_mode == APP_MODE_PMD_MGMT) {
+		rte_eal_mp_remote_launch(main_telemetry_loop, NULL,
+					 CALL_MASTER);
 	}
 
 	if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
@@ -2812,6 +2837,20 @@ main(int argc, char **argv)
 	if (app_mode == APP_MODE_EMPTY_POLL)
 		rte_power_empty_poll_stat_free();
 
+	if (app_mode == APP_MODE_PMD_MGMT) {
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			if (rte_lcore_is_enabled(lcore_id) == 0)
+				continue;
+			qconf = &lcore_conf[lcore_id];
+			for (queue = 0; queue < qconf->n_rx_queue; ++queue) {
+				portid = qconf->rx_queue_list[queue].port_id;
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				rte_power_pmd_mgmt_queue_disable(lcore_id,
+							portid, queueid);
+			}
+		}
+	}
+
 	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
 			deinit_power_library())
 		rte_exit(EXIT_FAILURE, "deinit_power_library failed\n");
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* [dpdk-dev] [PATCH v5 10/10] doc: update programmer's guide for power library
  2020-10-02 14:11     ` [dpdk-dev] [PATCH v4 01/10] eal: add new x86 cpuid support for WAITPKG Liang Ma
                         ` (19 preceding siblings ...)
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 09/10] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov
@ 2020-10-09 16:02       ` Anatoly Burakov
  20 siblings, 0 replies; 421+ messages in thread
From: Anatoly Burakov @ 2020-10-09 16:02 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, David Hunt, konstantin.ananyev, jerinjacobk,
	bruce.richardson, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

From: Liang Ma <liang.j.ma@intel.com>

Update programmer's guide to document PMD power management usage.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Moved l3fwd-power update to the l3fwd-power-related commit
    - Some rewordings and clarifications

 doc/guides/prog_guide/power_man.rst | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..38c64d31e4 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,45 @@ User Cases
 ----------
 The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
 
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+  * UMWAIT/UMONITOR
+
+   This power saving scheme will put the CPU into optimized power state and use
+   the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX descriptor
+   address, and wake the CPU up whenever there's new traffic.
+
+  * Pause
+
+   This power saving scheme will use the `rte_pause` function to avoid busy
+   polling.
+
+  * Frequency scaling
+
+   This power saving scheme will use existing power library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+
+.. note::
+
+   Currently, this power management API is limited to mandatory mapping of 1
+   queue to 1 core (multiple queues are supported, but they must be polled from
+   different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
 References
 ----------
 
@@ -200,3 +239,6 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../sample_app_ug/rxtx_callbacks`
+    chapter in the :doc:`../sample_app_ug/index` section.
-- 
2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics Anatoly Burakov
@ 2020-10-09 16:09         ` Jerin Jacob
  2020-10-09 16:24           ` Burakov, Anatoly
  2020-10-12 19:47         ` David Christensen
  1 sibling, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-09 16:09 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dpdk-dev, Liang Ma, Jan Viktorin, Ruifeng Wang,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	David Hunt, Thomas Monjalon, McDaniel, Timothy, Gage Eads,
	chris.macnamara

On Fri, Oct 9, 2020 at 9:32 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> From: Liang Ma <liang.j.ma@intel.com>
>
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
>
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
>
> For more details, please refer to Intel(R) 64 and IA-32 Architectures
> Software Developer's Manual, Volume 2.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
>     v5:
>     - Removed return values
>     - Simplified intrinsics and hardcoded C0.2 state
>     - Added other arch stubs
>
>  lib/librte_eal/arm/include/meson.build        |   1 +
>  .../arm/include/rte_power_intrinsics.h        |  62 ++++++++++
>  .../include/generic/rte_power_intrinsics.h    |  61 ++++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/ppc/include/meson.build        |   1 +
>  .../ppc/include/rte_power_intrinsics.h        |  62 ++++++++++
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 106 ++++++++++++++++++
>  8 files changed, 295 insertions(+)
>  create mode 100644 lib/librte_eal/arm/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/ppc/include/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
>
> diff --git a/lib/librte_eal/arm/include/meson.build b/lib/librte_eal/arm/include/meson.build
> index 73b750a18f..c6a9f70d73 100644
> --- a/lib/librte_eal/arm/include/meson.build
> +++ b/lib/librte_eal/arm/include/meson.build
> @@ -20,6 +20,7 @@ arch_headers = files(
>         'rte_pause_32.h',
>         'rte_pause_64.h',
>         'rte_pause.h',
> +       'rte_power_intrinsics.h',
>         'rte_prefetch_32.h',
>         'rte_prefetch_64.h',
>         'rte_prefetch.h',
> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..4aad44a0b9
> --- /dev/null
> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_ARM_H_
> +#define _RTE_POWER_INTRINSIC_ARM_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * This function is not supported on ARM.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 0 on success


remove return as it is a void function

> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(p);
> +       RTE_SET_USED(expected_value);
> +       RTE_SET_USED(value_mask);
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +/**
> + * This function is not supported on ARM.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_ARM_H_ */
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..e36c1f8976
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,61 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, a certain TSC timestamp is reached, or other
> + * reasons cause the CPU to wake up.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   - 0 on success
> + *   - -ENOTSUP if not supported
> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint64_t tsc_timestamp);
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index cd09027958..3a12e87e19 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -60,6 +60,7 @@ generic_headers = files(
>         'generic/rte_memcpy.h',
>         'generic/rte_pause.h',
>         'generic/rte_prefetch.h',
> +       'generic/rte_power_intrinsics.h',
>         'generic/rte_rwlock.h',
>         'generic/rte_spinlock.h',
>         'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/ppc/include/meson.build b/lib/librte_eal/ppc/include/meson.build
> index ab4bd28092..0873b2aecb 100644
> --- a/lib/librte_eal/ppc/include/meson.build
> +++ b/lib/librte_eal/ppc/include/meson.build
> @@ -10,6 +10,7 @@ arch_headers = files(
>         'rte_io.h',
>         'rte_memcpy.h',
>         'rte_pause.h',
> +       'rte_power_intrinsics.h',
>         'rte_prefetch.h',
>         'rte_rwlock.h',
>         'rte_spinlock.h',
> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..70fd7b094f
> --- /dev/null
> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_PPC_H_
> +#define _RTE_POWER_INTRINSIC_PPC_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * This function is not supported on PPC64.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 0 on success
> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(p);
> +       RTE_SET_USED(expected_value);
> +       RTE_SET_USED(value_mask);
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +/**
> + * This function is not supported on PPC64.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>         'rte_io.h',
>         'rte_memcpy.h',
>         'rte_prefetch.h',
> +       'rte_power_intrinsics.h',
>         'rte_pause.h',
>         'rte_rtm.h',
>         'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..8d579eaf64
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,106 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, a certain TSC timestamp is reached, or other
> + * reasons cause the CPU to wake up.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
> + * For more information about usage of these instructions, please refer to
> + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 0 on success
> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint64_t tsc_timestamp)
> +{
> +       const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +       const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +       /*
> +        * we're using raw byte codes for now as only the newest compiler
> +        * versions support this instruction natively.
> +        */
> +
> +       /* set address for UMONITOR */
> +       asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +                       :
> +                       : "D"(p));
> +
> +       if (value_mask) {
> +               const uint64_t cur_value = *(const volatile uint64_t *)p;
> +               const uint64_t masked = cur_value & value_mask;
> +               /* if the masked value is already matching, abort */
> +               if (masked == expected_value)
> +                       return;
> +       }
> +       /* execute UMWAIT */
> +       asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
> +               : /* ignore rflags */
> +               : "D"(0), /* enter C0.2 */
> +                 "a"(tsc_l), "d"(tsc_h));
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction  and will enter C0.2 state. For more
> + * information about usage of this instruction, please refer to Intel(R) 64 and
> + * IA-32 Architectures Software Developer's Manual.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +       const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +
> +       /* execute TPAUSE */
> +       asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;"
> +                    : /* ignore rflags */
> +                    : "D"(0), /* enter C0.2 */
> +                      "a"(tsc_l), "d"(tsc_h));
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 15:39             ` Ananyev, Konstantin
@ 2020-10-09 16:10               ` Burakov, Anatoly
  2020-10-09 16:56                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:10 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 09-Oct-20 4:39 PM, Ananyev, Konstantin wrote:
> 
>> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
>>>>
>>>> Add two new power management intrinsics, and provide an implementation
>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>> compiler support for these instructions.
>>>>
>>>> The power management instructions provide an architecture-specific
>>>> function to either wait until a specified TSC timestamp is reached, or
>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>> location is written to. The monitor function also provides an optional
>>>> comparison, to avoid sleeping when the expected write has already
>>>> happened, and no more writes are expected.
>>>
>>> I think what this API is missing - a function to wakeup sleeping core.
>>> If user can/should use some system call to achieve that, then at least
>>> it has to be clearly documented, even better some wrapper provided.
>>
>> I don't think it's possible to do that without severely overcomplicating
>> the intrinsic and its usage, because AFAIK the only way to wake up a
>> sleeping core would be to send some kind of interrupt to the core, or
>> trigger a write to the cache-line in question.
>>
> 
> Yes, I think we either need a syscall that would do an IPI for us
> (on top of my head - membarrier() does that, might be there are some other syscalls too),
> or something hand-made. For hand-made, I wonder would something like that
> be safe and sufficient:
> uint64_t val = atomic_load(addr);
> CAS(addr, val, &val);
> ?
> Anyway, one way or another - I think ability to wakeup core we put to sleep
> have to be an essential part of this feature.
> As I understand linux kernel will limit max amount of sleep time for these instructions:
> https://lwn.net/Articles/790920/
> But relying just on that, seems too vague for me:
> - user can adjust that value
> - wouldn't apply to older kernels and non-linux cases
> Konstantin
> 

This implies knowing the value the core is sleeping on. That's not 
always the case - with this particular PMD power management scheme, we 
get the address from the PMD and it stays inside the callback.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 03/10] ethdev: add simple power management API
  2020-10-08 22:26         ` Ananyev, Konstantin
@ 2020-10-09 16:11           ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:11 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 08-Oct-20 11:26 PM, Ananyev, Konstantin wrote:
>>
>> Add a simple API allow ethdev get wake up address from PMD.
>> Also include internal structure update.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   lib/librte_ethdev/rte_ethdev.c           | 19 ++++++++++++++++
>>   lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
>>   lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
>>   lib/librte_ethdev/rte_ethdev_version.map |  1 +
>>   4 files changed, 72 insertions(+)
>>
>> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
>> index d7668114ca..88253d95f9 100644
>> --- a/lib/librte_ethdev/rte_ethdev.c
>> +++ b/lib/librte_ethdev/rte_ethdev.c
>> @@ -4804,6 +4804,25 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>>          dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode));
>>   }
>>
>> +int
>> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
>> +      volatile void **wake_addr,
>> +      uint64_t *expected, uint64_t *mask)
>> +{
>> +struct rte_eth_dev *dev;
>> +uint16_t ret;
>> +
>> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
>> +
>> +dev = &rte_eth_devices[port_id];
>> +
>> +ret = (*dev->dev_ops->get_wake_addr)
>> +(dev->data->rx_queues[queue_id],
>> + wake_addr, expected, mask);
> 
> 
> This is an optional dev_ops, so I think you need to check that get_wake_addr()
> is defined for that PMD.
> Plus you need to check that queue_id is valid.
> 

Sorry, added dev_ops check but left out the queue id part :( Will fix in v6.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics
  2020-10-09 16:09         ` Jerin Jacob
@ 2020-10-09 16:24           ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:24 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Liang Ma, Jan Viktorin, Ruifeng Wang,
	David Christensen, Bruce Richardson, Konstantin Ananyev,
	David Hunt, Thomas Monjalon, McDaniel, Timothy, Gage Eads,
	chris.macnamara

On 09-Oct-20 5:09 PM, Jerin Jacob wrote:
> On Fri, Oct 9, 2020 at 9:32 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Add two new power management intrinsics, and provide an implementation
>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>> are implemented as raw byte opcodes because there is not yet widespread
>> compiler support for these instructions.
>>
>> The power management instructions provide an architecture-specific
>> function to either wait until a specified TSC timestamp is reached, or
>> optionally wait until either a TSC timestamp is reached or a memory
>> location is written to. The monitor function also provides an optional
>> comparison, to avoid sleeping when the expected write has already
>> happened, and no more writes are expected.
>>
>> For more details, please refer to Intel(R) 64 and IA-32 Architectures
>> Software Developer's Manual, Volume 2.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v5:
>>      - Removed return values
>>      - Simplified intrinsics and hardcoded C0.2 state
>>      - Added other arch stubs

<snip>

>> +
>> +/**
>> + * This function is not supported on ARM.
>> + *
>> + * @param p
>> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>> + * @param expected_value
>> + *   Before attempting the monitoring, the `p` address may be read and compared
>> + *   against this value. If `value_mask` is zero, this step will be skipped.
>> + * @param value_mask
>> + *   The 64-bit mask to use to extract current value from `p`.
>> + * @param tsc_timestamp
>> + *   Maximum TSC timestamp to wait for.
>> + *
>> + * @return
>> + *   - 0 on success
> 
> 
> remove return as it is a void function

Oops, will fix in v6. Thanks!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback
  2020-10-02 14:11       ` [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback Liang Ma
@ 2020-10-09 16:38         ` Ananyev, Konstantin
  2020-10-09 16:47           ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 16:38 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, stephen, Burakov, Anatoly

> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API support 1 port to multiple core use case.
> 
> This design leverage RX Callback mechnaism which allow three
> different power management methodology co exist.
> 
> 1. umwait/umonitor:
> 
>    The TSC timestamp is automatically calculated using current
>    link speed and RX descriptor ring size, such that the sleep
>    time is not longer than it would take for a NIC to fill its
>    entire RX descriptor ring.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this lightweight
>    method use Pause instruction to relief the processor from
>    busy polling.
> 
> 3. Frequency Scaling
>    Reuse exist rte power library to scale up/down core frequency
>    depend on traffic volume.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_power/meson.build           |   5 +-
>  lib/librte_power/pmd_mgmt.h            |  49 ++++++
>  lib/librte_power/rte_power_pmd_mgmt.c  | 208 +++++++++++++++++++++++++
>  lib/librte_power/rte_power_pmd_mgmt.h  |  88 +++++++++++
>  lib/librte_power/rte_power_version.map |   4 +
>  5 files changed, 352 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_power/pmd_mgmt.h
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
> 
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 78c031c943..cc3c7a8646 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
>  		'power_kvm_vm.c', 'guest_channel.c',
>  		'rte_power_empty_poll.c',
>  		'power_pstate_cpufreq.c',
> +		'rte_power_pmd_mgmt.c',
>  		'power_common.c')
> -headers = files('rte_power.h','rte_power_empty_poll.h')
> -deps += ['timer']
> +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
> +deps += ['timer' ,'ethdev']
> diff --git a/lib/librte_power/pmd_mgmt.h b/lib/librte_power/pmd_mgmt.h
> new file mode 100644
> index 0000000000..756fbe20f7
> --- /dev/null
> +++ b/lib/librte_power/pmd_mgmt.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#ifndef _PMD_MGMT_H
> +#define _PMD_MGMT_H
> +
> +/**
> + * @file
> + * Power Management
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * Possible power management states of an ethdev port.
> + */
> +enum pmd_mgmt_state {
> +	/** Device power management is disabled. */
> +	PMD_MGMT_DISABLED = 0,
> +	/** Device power management is enabled. */
> +	PMD_MGMT_ENABLED,
> +};
> +
> +struct pmd_queue_cfg {
> +	enum pmd_mgmt_state pwr_mgmt_state;
> +	/**< Power mgmt Callback mode */
> +	enum rte_power_pmd_mgmt_type cb_mode;
> +	/**< Empty poll number */
> +	uint16_t empty_poll_stats;
> +	/**< Callback instance  */
> +	const struct rte_eth_rxtx_callback *cur_cb;
> +} __rte_cache_aligned;
> +
> +struct pmd_port_cfg {
> +	int  ref_cnt;
> +	struct pmd_queue_cfg *queue_cfg;
> +} __rte_cache_aligned;
> +
> +
> +
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
> new file mode 100644
> index 0000000000..35d2af46a4
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
> @@ -0,0 +1,208 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_malloc.h>
> +#include <rte_ethdev.h>
> +#include <rte_power_intrinsics.h>
> +
> +#include "rte_power_pmd_mgmt.h"
> +#include "pmd_mgmt.h"
> +
> +
> +#define EMPTYPOLL_MAX  512
> +#define PAUSE_NUM  64
> +
> +static struct pmd_port_cfg port_cfg[RTE_MAX_ETHPORTS];
> +
> +static uint16_t
> +rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +
> +	struct pmd_queue_cfg *q_conf;
> +	q_conf = &port_cfg[port_id].queue_cfg[qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {


Here and in other places - wouldn't it be better to empty_poll_max as configurable
parameter, instead of constant value? 

> +			volatile void *target_addr;
> +			uint64_t expected, mask;
> +			uint16_t ret;
> +
> +			/*
> +			 * get address of next descriptor in the RX
> +			 * ring for this queue, as well as expected
> +			 * value and a mask.
> +			 */
> +			ret = rte_eth_get_wake_addr(port_id, qidx,
> +						    &target_addr, &expected,
> +						    &mask);
> +			if (ret == 0)
> +				/* -1ULL is maximum value for TSC */
> +				rte_power_monitor(target_addr,
> +						  expected, mask,
> +						  0, -1ULL);


Why not make timeout a user specified parameter?

> +		}
> +	} else
> +		q_conf->empty_poll_stats = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct pmd_queue_cfg *q_conf;
> +	int i;
> +	q_conf = &port_cfg[port_id].queue_cfg[qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +			for (i = 0; i < PAUSE_NUM; i++)
> +				rte_pause();

Just rte_delay_us(timeout) instead of this loop?

> +		}
> +	} else
> +		q_conf->empty_poll_stats = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct pmd_queue_cfg *q_conf;
> +	q_conf = &port_cfg[port_id].queue_cfg[qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +			/*scale down freq */
> +			rte_power_freq_min(rte_lcore_id());
> +
> +		}
> +	} else {
> +		q_conf->empty_poll_stats = 0;
> +		/* scal up freq */
> +		rte_power_freq_max(rte_lcore_id());
> +	}
> +
> +	return nb_rx;
> +}
> +

Probably worth to mention in comments that these functions enable/disable
are not MT safe.

> +int
> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> +				uint16_t port_id,
> +				uint16_t queue_id,
> +				enum rte_power_pmd_mgmt_type mode)
> +{
> +	struct rte_eth_dev *dev;
> +	struct pmd_queue_cfg *queue_cfg;
> +	int ret = 0;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (port_cfg[port_id].queue_cfg == NULL) {
> +		port_cfg[port_id].ref_cnt = 0;
> +		/* allocate memory for empty poll stats */
> +		port_cfg[port_id].queue_cfg  = rte_malloc_socket(NULL,
> +					sizeof(struct pmd_queue_cfg)
> +					* RTE_MAX_QUEUES_PER_PORT,
> +					0, dev->data->numa_node);
> +		if (port_cfg[port_id].queue_cfg == NULL)
> +			return -ENOMEM;
> +	}
> +
> +	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> +		ret = -EINVAL;
> +		goto failure_handler;
> +	}
> +
> +	switch (mode) {
> +	case RTE_POWER_MGMT_TYPE_WAIT:
> +		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> +			ret = -ENOTSUP;
> +			goto failure_handler;
> +		}
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +						rte_power_mgmt_umwait, NULL);
> +		break;
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +		/* init scale freq */
> +		if (rte_power_init(lcore_id)) {
> +			ret = -EINVAL;
> +			goto failure_handler;
> +		}
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +					rte_power_mgmt_scalefreq, NULL);
> +		break;
> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +						rte_power_mgmt_pause, NULL);
> +		break;

	default:
		....

> +	}
> +	queue_cfg->cb_mode = mode;
> +	port_cfg[port_id].ref_cnt++;
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +	return ret;
> +
> +failure_handler:
> +	if (port_cfg[port_id].ref_cnt == 0) {
> +		rte_free(port_cfg[port_id].queue_cfg);
> +		port_cfg[port_id].queue_cfg = NULL;
> +	}
> +	return ret;
> +}
> +
> +int
> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> +				uint16_t port_id,
> +				uint16_t queue_id)
> +{
> +	struct pmd_queue_cfg *queue_cfg;
> +
> +	if (port_cfg[port_id].ref_cnt <= 0)
> +		return -EINVAL;
> +
> +	queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED)
> +		return -EINVAL;
> +
> +	switch (queue_cfg->cb_mode) {
> +	case RTE_POWER_MGMT_TYPE_WAIT:

Think we need wakeup(lcore_id) here.

> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +					   queue_cfg->cur_cb);
> +		break;
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +		rte_power_freq_max(lcore_id);
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +					   queue_cfg->cur_cb);
> +		rte_power_exit(lcore_id);
> +		break;
> +	}
> +	/* it's not recommend to free callback instance here.
> +	 * it cause memory leak which is a known issue.
> +	 */
> +	queue_cfg->cur_cb = NULL;
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +	port_cfg[port_id].ref_cnt--;
> +
> +	if (port_cfg[port_id].ref_cnt == 0) {
> +		rte_free(port_cfg[port_id].queue_cfg);

It is not safe to do so, unless device is already stopped.
Otherwise you need some sync mechanism here (hand-made as bpf lib, or rcu online/offline, or...)

> +		port_cfg[port_id].queue_cfg = NULL;
> +	}
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback
  2020-10-09 16:38         ` Ananyev, Konstantin
@ 2020-10-09 16:47           ` Burakov, Anatoly
  2020-10-09 16:51             ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:47 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 09-Oct-20 5:38 PM, Ananyev, Konstantin wrote:
>> Add a simple on/off switch that will enable saving power when no
>> packets are arriving. It is based on counting the number of empty
>> polls and, when the number reaches a certain threshold, entering an
>> architecture-defined optimized power state that will either wait
>> until a TSC timestamp expires, or when packets arrive.
>>
>> This API support 1 port to multiple core use case.
>>
>> This design leverage RX Callback mechnaism which allow three
>> different power management methodology co exist.
>>
>> 1. umwait/umonitor:
>>
>>     The TSC timestamp is automatically calculated using current
>>     link speed and RX descriptor ring size, such that the sleep
>>     time is not longer than it would take for a NIC to fill its
>>     entire RX descriptor ring.
>>
>> 2. Pause instruction
>>
>>     Instead of move the core into deeper C state, this lightweight
>>     method use Pause instruction to relief the processor from
>>     busy polling.
>>
>> 3. Frequency Scaling
>>     Reuse exist rte power library to scale up/down core frequency
>>     depend on traffic volume.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   lib/librte_power/meson.build           |   5 +-
>>   lib/librte_power/pmd_mgmt.h            |  49 ++++++
>>   lib/librte_power/rte_power_pmd_mgmt.c  | 208 +++++++++++++++++++++++++
>>   lib/librte_power/rte_power_pmd_mgmt.h  |  88 +++++++++++
>>   lib/librte_power/rte_power_version.map |   4 +
>>   5 files changed, 352 insertions(+), 2 deletions(-)
>>   create mode 100644 lib/librte_power/pmd_mgmt.h
>>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
>>
>> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
>> index 78c031c943..cc3c7a8646 100644
>> --- a/lib/librte_power/meson.build
>> +++ b/lib/librte_power/meson.build
>> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
>>   'power_kvm_vm.c', 'guest_channel.c',
>>   'rte_power_empty_poll.c',
>>   'power_pstate_cpufreq.c',
>> +'rte_power_pmd_mgmt.c',
>>   'power_common.c')
>> -headers = files('rte_power.h','rte_power_empty_poll.h')
>> -deps += ['timer']
>> +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
>> +deps += ['timer' ,'ethdev']
>> diff --git a/lib/librte_power/pmd_mgmt.h b/lib/librte_power/pmd_mgmt.h
>> new file mode 100644
>> index 0000000000..756fbe20f7
>> --- /dev/null
>> +++ b/lib/librte_power/pmd_mgmt.h
>> @@ -0,0 +1,49 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2010-2020 Intel Corporation
>> + */
>> +
>> +#ifndef _PMD_MGMT_H
>> +#define _PMD_MGMT_H
>> +
>> +/**
>> + * @file
>> + * Power Management
>> + */
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +/**
>> + * Possible power management states of an ethdev port.
>> + */
>> +enum pmd_mgmt_state {
>> +/** Device power management is disabled. */
>> +PMD_MGMT_DISABLED = 0,
>> +/** Device power management is enabled. */
>> +PMD_MGMT_ENABLED,
>> +};
>> +
>> +struct pmd_queue_cfg {
>> +enum pmd_mgmt_state pwr_mgmt_state;
>> +/**< Power mgmt Callback mode */
>> +enum rte_power_pmd_mgmt_type cb_mode;
>> +/**< Empty poll number */
>> +uint16_t empty_poll_stats;
>> +/**< Callback instance  */
>> +const struct rte_eth_rxtx_callback *cur_cb;
>> +} __rte_cache_aligned;
>> +
>> +struct pmd_port_cfg {
>> +int  ref_cnt;
>> +struct pmd_queue_cfg *queue_cfg;
>> +} __rte_cache_aligned;
>> +
>> +
>> +
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif
>> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
>> new file mode 100644
>> index 0000000000..35d2af46a4
>> --- /dev/null
>> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
>> @@ -0,0 +1,208 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2010-2020 Intel Corporation
>> + */
>> +
>> +#include <rte_lcore.h>
>> +#include <rte_cycles.h>
>> +#include <rte_malloc.h>
>> +#include <rte_ethdev.h>
>> +#include <rte_power_intrinsics.h>
>> +
>> +#include "rte_power_pmd_mgmt.h"
>> +#include "pmd_mgmt.h"
>> +
>> +
>> +#define EMPTYPOLL_MAX  512
>> +#define PAUSE_NUM  64
>> +
>> +static struct pmd_port_cfg port_cfg[RTE_MAX_ETHPORTS];
>> +
>> +static uint16_t
>> +rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
>> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> +{
>> +
>> +struct pmd_queue_cfg *q_conf;
>> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
>> +
>> +if (unlikely(nb_rx == 0)) {
>> +q_conf->empty_poll_stats++;
>> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> 
> 
> Here and in other places - wouldn't it be better to empty_poll_max as configurable
> parameter, instead of constant value?

It would be more flexible, but i don't think it's "better" in the sense 
that providing additional options will only make using this (already 
under-utilized!) API harder than it needs to be.

> 
>> +volatile void *target_addr;
>> +uint64_t expected, mask;
>> +uint16_t ret;
>> +
>> +/*
>> + * get address of next descriptor in the RX
>> + * ring for this queue, as well as expected
>> + * value and a mask.
>> + */
>> +ret = rte_eth_get_wake_addr(port_id, qidx,
>> +    &target_addr, &expected,
>> +    &mask);
>> +if (ret == 0)
>> +/* -1ULL is maximum value for TSC */
>> +rte_power_monitor(target_addr,
>> +  expected, mask,
>> +  0, -1ULL);
> 
> 
> Why not make timeout a user specified parameter?

This is meant to be an "easy to use" API, we were trying to keep the 
amount of configuration to an absolute minimum. If the user wants to use 
custom timeouts, they can do so with using rte_power_monitor API explicitly.

> 
>> +}
>> +} else
>> +q_conf->empty_poll_stats = 0;
>> +
>> +return nb_rx;
>> +}
>> +
>> +static uint16_t
>> +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
>> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> +{
>> +struct pmd_queue_cfg *q_conf;
>> +int i;
>> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
>> +
>> +if (unlikely(nb_rx == 0)) {
>> +q_conf->empty_poll_stats++;
>> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> +for (i = 0; i < PAUSE_NUM; i++)
>> +rte_pause();
> 
> Just rte_delay_us(timeout) instead of this loop?

Yes, seems better, thanks.

> 
>> +}
>> +} else
>> +q_conf->empty_poll_stats = 0;
>> +
>> +return nb_rx;
>> +}
>> +
>> +static uint16_t
>> +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
>> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> +{
>> +struct pmd_queue_cfg *q_conf;
>> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
>> +
>> +if (unlikely(nb_rx == 0)) {
>> +q_conf->empty_poll_stats++;
>> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> +/*scale down freq */
>> +rte_power_freq_min(rte_lcore_id());
>> +
>> +}
>> +} else {
>> +q_conf->empty_poll_stats = 0;
>> +/* scal up freq */
>> +rte_power_freq_max(rte_lcore_id());
>> +}
>> +
>> +return nb_rx;
>> +}
>> +
> 
> Probably worth to mention in comments that these functions enable/disable
> are not MT safe.

Will do in v6.

> 
>> +int
>> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
>> +uint16_t port_id,
>> +uint16_t queue_id,
>> +enum rte_power_pmd_mgmt_type mode)
>> +{
>> +struct rte_eth_dev *dev;
>> +struct pmd_queue_cfg *queue_cfg;
>> +int ret = 0;
>> +
>> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
>> +dev = &rte_eth_devices[port_id];
>> +
>> +if (port_cfg[port_id].queue_cfg == NULL) {
>> +port_cfg[port_id].ref_cnt = 0;
>> +/* allocate memory for empty poll stats */
>> +port_cfg[port_id].queue_cfg  = rte_malloc_socket(NULL,
>> +sizeof(struct pmd_queue_cfg)
>> +* RTE_MAX_QUEUES_PER_PORT,
>> +0, dev->data->numa_node);
>> +if (port_cfg[port_id].queue_cfg == NULL)
>> +return -ENOMEM;
>> +}
>> +
>> +queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
>> +
>> +if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
>> +ret = -EINVAL;
>> +goto failure_handler;
>> +}
>> +
>> +switch (mode) {
>> +case RTE_POWER_MGMT_TYPE_WAIT:
>> +if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>> +ret = -ENOTSUP;
>> +goto failure_handler;
>> +}
>> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +rte_power_mgmt_umwait, NULL);
>> +break;
>> +case RTE_POWER_MGMT_TYPE_SCALE:
>> +/* init scale freq */
>> +if (rte_power_init(lcore_id)) {
>> +ret = -EINVAL;
>> +goto failure_handler;
>> +}
>> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +rte_power_mgmt_scalefreq, NULL);
>> +break;
>> +case RTE_POWER_MGMT_TYPE_PAUSE:
>> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +rte_power_mgmt_pause, NULL);
>> +break;
> 
> default:
> ....

Will add in v6.

> 
>> +}
>> +queue_cfg->cb_mode = mode;
>> +port_cfg[port_id].ref_cnt++;
>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> +return ret;
>> +
>> +failure_handler:
>> +if (port_cfg[port_id].ref_cnt == 0) {
>> +rte_free(port_cfg[port_id].queue_cfg);
>> +port_cfg[port_id].queue_cfg = NULL;
>> +}
>> +return ret;
>> +}
>> +
>> +int
>> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>> +uint16_t port_id,
>> +uint16_t queue_id)
>> +{
>> +struct pmd_queue_cfg *queue_cfg;
>> +
>> +if (port_cfg[port_id].ref_cnt <= 0)
>> +return -EINVAL;
>> +
>> +queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
>> +
>> +if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED)
>> +return -EINVAL;
>> +
>> +switch (queue_cfg->cb_mode) {
>> +case RTE_POWER_MGMT_TYPE_WAIT:
> 
> Think we need wakeup(lcore_id) here.

Not sure what you mean? Could you please elaborate?

> 
>> +case RTE_POWER_MGMT_TYPE_PAUSE:
>> +rte_eth_remove_rx_callback(port_id, queue_id,
>> +   queue_cfg->cur_cb);
>> +break;
>> +case RTE_POWER_MGMT_TYPE_SCALE:
>> +rte_power_freq_max(lcore_id);
>> +rte_eth_remove_rx_callback(port_id, queue_id,
>> +   queue_cfg->cur_cb);
>> +rte_power_exit(lcore_id);
>> +break;
>> +}
>> +/* it's not recommend to free callback instance here.
>> + * it cause memory leak which is a known issue.
>> + */
>> +queue_cfg->cur_cb = NULL;
>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>> +port_cfg[port_id].ref_cnt--;
>> +
>> +if (port_cfg[port_id].ref_cnt == 0) {
>> +rte_free(port_cfg[port_id].queue_cfg);
> 
> It is not safe to do so, unless device is already stopped.
> Otherwise you need some sync mechanism here (hand-made as bpf lib, or rcu online/offline, or...)

Not sure what you mean. We're not freeing the callback structure, we're 
freeing the local data structure holding the per-port status.

> 
>> +port_cfg[port_id].queue_cfg = NULL;
>> +}
>> +return 0;
>> +}


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback
  2020-10-09 16:47           ` Burakov, Anatoly
@ 2020-10-09 16:51             ` Ananyev, Konstantin
  2020-10-09 16:56               ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 16:51 UTC (permalink / raw)
  To: Burakov, Anatoly, Ma, Liang J, dev; +Cc: Hunt, David, stephen

> On 09-Oct-20 5:38 PM, Ananyev, Konstantin wrote:
> >> Add a simple on/off switch that will enable saving power when no
> >> packets are arriving. It is based on counting the number of empty
> >> polls and, when the number reaches a certain threshold, entering an
> >> architecture-defined optimized power state that will either wait
> >> until a TSC timestamp expires, or when packets arrive.
> >>
> >> This API support 1 port to multiple core use case.
> >>
> >> This design leverage RX Callback mechnaism which allow three
> >> different power management methodology co exist.
> >>
> >> 1. umwait/umonitor:
> >>
> >>     The TSC timestamp is automatically calculated using current
> >>     link speed and RX descriptor ring size, such that the sleep
> >>     time is not longer than it would take for a NIC to fill its
> >>     entire RX descriptor ring.
> >>
> >> 2. Pause instruction
> >>
> >>     Instead of move the core into deeper C state, this lightweight
> >>     method use Pause instruction to relief the processor from
> >>     busy polling.
> >>
> >> 3. Frequency Scaling
> >>     Reuse exist rte power library to scale up/down core frequency
> >>     depend on traffic volume.
> >>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>   lib/librte_power/meson.build           |   5 +-
> >>   lib/librte_power/pmd_mgmt.h            |  49 ++++++
> >>   lib/librte_power/rte_power_pmd_mgmt.c  | 208 +++++++++++++++++++++++++
> >>   lib/librte_power/rte_power_pmd_mgmt.h  |  88 +++++++++++
> >>   lib/librte_power/rte_power_version.map |   4 +
> >>   5 files changed, 352 insertions(+), 2 deletions(-)
> >>   create mode 100644 lib/librte_power/pmd_mgmt.h
> >>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
> >>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
> >>
> >> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> >> index 78c031c943..cc3c7a8646 100644
> >> --- a/lib/librte_power/meson.build
> >> +++ b/lib/librte_power/meson.build
> >> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> >>   'power_kvm_vm.c', 'guest_channel.c',
> >>   'rte_power_empty_poll.c',
> >>   'power_pstate_cpufreq.c',
> >> +'rte_power_pmd_mgmt.c',
> >>   'power_common.c')
> >> -headers = files('rte_power.h','rte_power_empty_poll.h')
> >> -deps += ['timer']
> >> +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
> >> +deps += ['timer' ,'ethdev']
> >> diff --git a/lib/librte_power/pmd_mgmt.h b/lib/librte_power/pmd_mgmt.h
> >> new file mode 100644
> >> index 0000000000..756fbe20f7
> >> --- /dev/null
> >> +++ b/lib/librte_power/pmd_mgmt.h
> >> @@ -0,0 +1,49 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2010-2020 Intel Corporation
> >> + */
> >> +
> >> +#ifndef _PMD_MGMT_H
> >> +#define _PMD_MGMT_H
> >> +
> >> +/**
> >> + * @file
> >> + * Power Management
> >> + */
> >> +
> >> +#ifdef __cplusplus
> >> +extern "C" {
> >> +#endif
> >> +
> >> +/**
> >> + * Possible power management states of an ethdev port.
> >> + */
> >> +enum pmd_mgmt_state {
> >> +/** Device power management is disabled. */
> >> +PMD_MGMT_DISABLED = 0,
> >> +/** Device power management is enabled. */
> >> +PMD_MGMT_ENABLED,
> >> +};
> >> +
> >> +struct pmd_queue_cfg {
> >> +enum pmd_mgmt_state pwr_mgmt_state;
> >> +/**< Power mgmt Callback mode */
> >> +enum rte_power_pmd_mgmt_type cb_mode;
> >> +/**< Empty poll number */
> >> +uint16_t empty_poll_stats;
> >> +/**< Callback instance  */
> >> +const struct rte_eth_rxtx_callback *cur_cb;
> >> +} __rte_cache_aligned;
> >> +
> >> +struct pmd_port_cfg {
> >> +int  ref_cnt;
> >> +struct pmd_queue_cfg *queue_cfg;
> >> +} __rte_cache_aligned;
> >> +
> >> +
> >> +
> >> +
> >> +#ifdef __cplusplus
> >> +}
> >> +#endif
> >> +
> >> +#endif
> >> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
> >> new file mode 100644
> >> index 0000000000..35d2af46a4
> >> --- /dev/null
> >> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
> >> @@ -0,0 +1,208 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2010-2020 Intel Corporation
> >> + */
> >> +
> >> +#include <rte_lcore.h>
> >> +#include <rte_cycles.h>
> >> +#include <rte_malloc.h>
> >> +#include <rte_ethdev.h>
> >> +#include <rte_power_intrinsics.h>
> >> +
> >> +#include "rte_power_pmd_mgmt.h"
> >> +#include "pmd_mgmt.h"
> >> +
> >> +
> >> +#define EMPTYPOLL_MAX  512
> >> +#define PAUSE_NUM  64
> >> +
> >> +static struct pmd_port_cfg port_cfg[RTE_MAX_ETHPORTS];
> >> +
> >> +static uint16_t
> >> +rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
> >> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> >> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> >> +{
> >> +
> >> +struct pmd_queue_cfg *q_conf;
> >> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
> >> +
> >> +if (unlikely(nb_rx == 0)) {
> >> +q_conf->empty_poll_stats++;
> >> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> >
> >
> > Here and in other places - wouldn't it be better to empty_poll_max as configurable
> > parameter, instead of constant value?
> 
> It would be more flexible, but i don't think it's "better" in the sense
> that providing additional options will only make using this (already
> under-utilized!) API harder than it needs to be.
> 
> >
> >> +volatile void *target_addr;
> >> +uint64_t expected, mask;
> >> +uint16_t ret;
> >> +
> >> +/*
> >> + * get address of next descriptor in the RX
> >> + * ring for this queue, as well as expected
> >> + * value and a mask.
> >> + */
> >> +ret = rte_eth_get_wake_addr(port_id, qidx,
> >> +    &target_addr, &expected,
> >> +    &mask);
> >> +if (ret == 0)
> >> +/* -1ULL is maximum value for TSC */
> >> +rte_power_monitor(target_addr,
> >> +  expected, mask,
> >> +  0, -1ULL);
> >
> >
> > Why not make timeout a user specified parameter?
> 
> This is meant to be an "easy to use" API, we were trying to keep the
> amount of configuration to an absolute minimum. If the user wants to use
> custom timeouts, they can do so with using rte_power_monitor API explicitly.
> 
> >
> >> +}
> >> +} else
> >> +q_conf->empty_poll_stats = 0;
> >> +
> >> +return nb_rx;
> >> +}
> >> +
> >> +static uint16_t
> >> +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
> >> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> >> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> >> +{
> >> +struct pmd_queue_cfg *q_conf;
> >> +int i;
> >> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
> >> +
> >> +if (unlikely(nb_rx == 0)) {
> >> +q_conf->empty_poll_stats++;
> >> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> >> +for (i = 0; i < PAUSE_NUM; i++)
> >> +rte_pause();
> >
> > Just rte_delay_us(timeout) instead of this loop?
> 
> Yes, seems better, thanks.
> 
> >
> >> +}
> >> +} else
> >> +q_conf->empty_poll_stats = 0;
> >> +
> >> +return nb_rx;
> >> +}
> >> +
> >> +static uint16_t
> >> +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> >> +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> >> +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> >> +{
> >> +struct pmd_queue_cfg *q_conf;
> >> +q_conf = &port_cfg[port_id].queue_cfg[qidx];
> >> +
> >> +if (unlikely(nb_rx == 0)) {
> >> +q_conf->empty_poll_stats++;
> >> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> >> +/*scale down freq */
> >> +rte_power_freq_min(rte_lcore_id());
> >> +
> >> +}
> >> +} else {
> >> +q_conf->empty_poll_stats = 0;
> >> +/* scal up freq */
> >> +rte_power_freq_max(rte_lcore_id());
> >> +}
> >> +
> >> +return nb_rx;
> >> +}
> >> +
> >
> > Probably worth to mention in comments that these functions enable/disable
> > are not MT safe.
> 
> Will do in v6.
> 
> >
> >> +int
> >> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> >> +uint16_t port_id,
> >> +uint16_t queue_id,
> >> +enum rte_power_pmd_mgmt_type mode)
> >> +{
> >> +struct rte_eth_dev *dev;
> >> +struct pmd_queue_cfg *queue_cfg;
> >> +int ret = 0;
> >> +
> >> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> >> +dev = &rte_eth_devices[port_id];
> >> +
> >> +if (port_cfg[port_id].queue_cfg == NULL) {
> >> +port_cfg[port_id].ref_cnt = 0;
> >> +/* allocate memory for empty poll stats */
> >> +port_cfg[port_id].queue_cfg  = rte_malloc_socket(NULL,
> >> +sizeof(struct pmd_queue_cfg)
> >> +* RTE_MAX_QUEUES_PER_PORT,
> >> +0, dev->data->numa_node);
> >> +if (port_cfg[port_id].queue_cfg == NULL)
> >> +return -ENOMEM;
> >> +}
> >> +
> >> +queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
> >> +
> >> +if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> >> +ret = -EINVAL;
> >> +goto failure_handler;
> >> +}
> >> +
> >> +switch (mode) {
> >> +case RTE_POWER_MGMT_TYPE_WAIT:
> >> +if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> >> +ret = -ENOTSUP;
> >> +goto failure_handler;
> >> +}
> >> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> +rte_power_mgmt_umwait, NULL);
> >> +break;
> >> +case RTE_POWER_MGMT_TYPE_SCALE:
> >> +/* init scale freq */
> >> +if (rte_power_init(lcore_id)) {
> >> +ret = -EINVAL;
> >> +goto failure_handler;
> >> +}
> >> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> +rte_power_mgmt_scalefreq, NULL);
> >> +break;
> >> +case RTE_POWER_MGMT_TYPE_PAUSE:
> >> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> +rte_power_mgmt_pause, NULL);
> >> +break;
> >
> > default:
> > ....
> 
> Will add in v6.
> 
> >
> >> +}
> >> +queue_cfg->cb_mode = mode;
> >> +port_cfg[port_id].ref_cnt++;
> >> +queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> >> +return ret;
> >> +
> >> +failure_handler:
> >> +if (port_cfg[port_id].ref_cnt == 0) {
> >> +rte_free(port_cfg[port_id].queue_cfg);
> >> +port_cfg[port_id].queue_cfg = NULL;
> >> +}
> >> +return ret;
> >> +}
> >> +
> >> +int
> >> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> >> +uint16_t port_id,
> >> +uint16_t queue_id)
> >> +{
> >> +struct pmd_queue_cfg *queue_cfg;
> >> +
> >> +if (port_cfg[port_id].ref_cnt <= 0)
> >> +return -EINVAL;
> >> +
> >> +queue_cfg = &port_cfg[port_id].queue_cfg[queue_id];
> >> +
> >> +if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED)
> >> +return -EINVAL;
> >> +
> >> +switch (queue_cfg->cb_mode) {
> >> +case RTE_POWER_MGMT_TYPE_WAIT:
> >
> > Think we need wakeup(lcore_id) here.
> 
> Not sure what you mean? Could you please elaborate?
> 
> >
> >> +case RTE_POWER_MGMT_TYPE_PAUSE:
> >> +rte_eth_remove_rx_callback(port_id, queue_id,
> >> +   queue_cfg->cur_cb);
> >> +break;
> >> +case RTE_POWER_MGMT_TYPE_SCALE:
> >> +rte_power_freq_max(lcore_id);
> >> +rte_eth_remove_rx_callback(port_id, queue_id,
> >> +   queue_cfg->cur_cb);
> >> +rte_power_exit(lcore_id);
> >> +break;
> >> +}
> >> +/* it's not recommend to free callback instance here.
> >> + * it cause memory leak which is a known issue.
> >> + */
> >> +queue_cfg->cur_cb = NULL;
> >> +queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> >> +port_cfg[port_id].ref_cnt--;
> >> +
> >> +if (port_cfg[port_id].ref_cnt == 0) {
> >> +rte_free(port_cfg[port_id].queue_cfg);
> >
> > It is not safe to do so, unless device is already stopped.
> > Otherwise you need some sync mechanism here (hand-made as bpf lib, or rcu online/offline, or...)
> 
> Not sure what you mean. We're not freeing the callback structure, we're
> freeing the local data structure holding the per-port status.

What is the difference?
You still trying to free memory that might be used by your DP thread
that still executes the callback.

> 
> >
> >> +port_cfg[port_id].queue_cfg = NULL;
> >> +}
> >> +return 0;
> >> +}
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 04/10] power: add simple power management API and callback
  2020-10-09 16:51             ` Ananyev, Konstantin
@ 2020-10-09 16:56               ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:56 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 09-Oct-20 5:51 PM, Ananyev, Konstantin wrote:

>>>
>>>> +case RTE_POWER_MGMT_TYPE_PAUSE:
>>>> +rte_eth_remove_rx_callback(port_id, queue_id,
>>>> +   queue_cfg->cur_cb);
>>>> +break;
>>>> +case RTE_POWER_MGMT_TYPE_SCALE:
>>>> +rte_power_freq_max(lcore_id);
>>>> +rte_eth_remove_rx_callback(port_id, queue_id,
>>>> +   queue_cfg->cur_cb);
>>>> +rte_power_exit(lcore_id);
>>>> +break;
>>>> +}
>>>> +/* it's not recommend to free callback instance here.
>>>> + * it cause memory leak which is a known issue.
>>>> + */
>>>> +queue_cfg->cur_cb = NULL;
>>>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>>>> +port_cfg[port_id].ref_cnt--;
>>>> +
>>>> +if (port_cfg[port_id].ref_cnt == 0) {
>>>> +rte_free(port_cfg[port_id].queue_cfg);
>>>
>>> It is not safe to do so, unless device is already stopped.
>>> Otherwise you need some sync mechanism here (hand-made as bpf lib, or rcu online/offline, or...)
>>
>> Not sure what you mean. We're not freeing the callback structure, we're
>> freeing the local data structure holding the per-port status.
> 
> What is the difference?
> You still trying to free memory that might be used by your DP thread
> that still executes the callback.

Welp, you're right :/ I'll see what i can do to fix it.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 16:10               ` Burakov, Anatoly
@ 2020-10-09 16:56                 ` Ananyev, Konstantin
  2020-10-09 16:59                   ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-09 16:56 UTC (permalink / raw)
  To: Burakov, Anatoly, Ma, Liang J, dev; +Cc: Hunt, David, stephen


> 
> On 09-Oct-20 4:39 PM, Ananyev, Konstantin wrote:
> >
> >> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
> >>>>
> >>>> Add two new power management intrinsics, and provide an implementation
> >>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>> compiler support for these instructions.
> >>>>
> >>>> The power management instructions provide an architecture-specific
> >>>> function to either wait until a specified TSC timestamp is reached, or
> >>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>> location is written to. The monitor function also provides an optional
> >>>> comparison, to avoid sleeping when the expected write has already
> >>>> happened, and no more writes are expected.
> >>>
> >>> I think what this API is missing - a function to wakeup sleeping core.
> >>> If user can/should use some system call to achieve that, then at least
> >>> it has to be clearly documented, even better some wrapper provided.
> >>
> >> I don't think it's possible to do that without severely overcomplicating
> >> the intrinsic and its usage, because AFAIK the only way to wake up a
> >> sleeping core would be to send some kind of interrupt to the core, or
> >> trigger a write to the cache-line in question.
> >>
> >
> > Yes, I think we either need a syscall that would do an IPI for us
> > (on top of my head - membarrier() does that, might be there are some other syscalls too),
> > or something hand-made. For hand-made, I wonder would something like that
> > be safe and sufficient:
> > uint64_t val = atomic_load(addr);
> > CAS(addr, val, &val);
> > ?
> > Anyway, one way or another - I think ability to wakeup core we put to sleep
> > have to be an essential part of this feature.
> > As I understand linux kernel will limit max amount of sleep time for these instructions:
> > https://lwn.net/Articles/790920/
> > But relying just on that, seems too vague for me:
> > - user can adjust that value
> > - wouldn't apply to older kernels and non-linux cases
> > Konstantin
> >
> 
> This implies knowing the value the core is sleeping on.

You don't the value to wait for, you just need an address.
And you can make wakeup function to accept address as a parameter,
same as monitor() does. 

> That's not
> always the case - with this particular PMD power management scheme, we
> get the address from the PMD and it stays inside the callback.

That's fine - you can store address inside you callback metadata 
and do wakeup as part of _disable_ function.

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 16:56                 ` Ananyev, Konstantin
@ 2020-10-09 16:59                   ` Burakov, Anatoly
  2020-10-10 13:19                     ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 16:59 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 09-Oct-20 5:56 PM, Ananyev, Konstantin wrote:
> 
>>
>> On 09-Oct-20 4:39 PM, Ananyev, Konstantin wrote:
>>>
>>>> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
>>>>>>
>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>> compiler support for these instructions.
>>>>>>
>>>>>> The power management instructions provide an architecture-specific
>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>> location is written to. The monitor function also provides an optional
>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>> happened, and no more writes are expected.
>>>>>
>>>>> I think what this API is missing - a function to wakeup sleeping core.
>>>>> If user can/should use some system call to achieve that, then at least
>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>
>>>> I don't think it's possible to do that without severely overcomplicating
>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>> sleeping core would be to send some kind of interrupt to the core, or
>>>> trigger a write to the cache-line in question.
>>>>
>>>
>>> Yes, I think we either need a syscall that would do an IPI for us
>>> (on top of my head - membarrier() does that, might be there are some other syscalls too),
>>> or something hand-made. For hand-made, I wonder would something like that
>>> be safe and sufficient:
>>> uint64_t val = atomic_load(addr);
>>> CAS(addr, val, &val);
>>> ?
>>> Anyway, one way or another - I think ability to wakeup core we put to sleep
>>> have to be an essential part of this feature.
>>> As I understand linux kernel will limit max amount of sleep time for these instructions:
>>> https://lwn.net/Articles/790920/
>>> But relying just on that, seems too vague for me:
>>> - user can adjust that value
>>> - wouldn't apply to older kernels and non-linux cases
>>> Konstantin
>>>
>>
>> This implies knowing the value the core is sleeping on.
> 
> You don't the value to wait for, you just need an address.
> And you can make wakeup function to accept address as a parameter,
> same as monitor() does.

Sorry, i meant the address. We don't know the address we're sleeping on.

> 
>> That's not
>> always the case - with this particular PMD power management scheme, we
>> get the address from the PMD and it stays inside the callback.
> 
> That's fine - you can store address inside you callback metadata
> and do wakeup as part of _disable_ function.
> 

The address may be different, and by the time we access the address it 
may become stale, so i don't see how that would help unless you're 
suggesting to have some kind of synchronization mechanism there.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 01/10] eal: add new x86 cpuid support for WAITPKG
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 " Anatoly Burakov
@ 2020-10-09 17:06         ` Burakov, Anatoly
  2020-10-14 13:30         ` [dpdk-dev] [PATCH v6 " Anatoly Burakov
                           ` (9 subsequent siblings)
  10 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-09 17:06 UTC (permalink / raw)
  To: dev
  Cc: Liang Ma, Bruce Richardson, Konstantin Ananyev, david.hunt,
	jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara

On 09-Oct-20 5:02 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add new x86 cpuid support for WAITPKG.
> This flag indicate processor support umwait/umonitor/tpause
> instruction.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

The work on this patchset is clearly going to take a few more days, but 
this particular patch is high priority as there are other dependencies 
on this patch for DLB drivers. Would it be possible to accept this 
particular patch now...

> ---
>   lib/librte_eal/x86/include/rte_cpuflags.h | 2 ++
>   lib/librte_eal/x86/rte_cpuflags.c         | 2 ++
>   2 files changed, 4 insertions(+)
> 
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d1..5041a830a7 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
>   	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
>   	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
>   
> +	/**< UMWAIT/TPAUSE Instructions */
> +	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */

...with typo fix (UMONITOR rather than UMINITOR)?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-09 16:59                   ` Burakov, Anatoly
@ 2020-10-10 13:19                     ` Ananyev, Konstantin
  2020-10-12 10:35                       ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-10 13:19 UTC (permalink / raw)
  To: Burakov, Anatoly, Ma, Liang J, dev; +Cc: Hunt, David, stephen



> >>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>> compiler support for these instructions.
> >>>>>>
> >>>>>> The power management instructions provide an architecture-specific
> >>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>> location is written to. The monitor function also provides an optional
> >>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>> happened, and no more writes are expected.
> >>>>>
> >>>>> I think what this API is missing - a function to wakeup sleeping core.
> >>>>> If user can/should use some system call to achieve that, then at least
> >>>>> it has to be clearly documented, even better some wrapper provided.
> >>>>
> >>>> I don't think it's possible to do that without severely overcomplicating
> >>>> the intrinsic and its usage, because AFAIK the only way to wake up a
> >>>> sleeping core would be to send some kind of interrupt to the core, or
> >>>> trigger a write to the cache-line in question.
> >>>>
> >>>
> >>> Yes, I think we either need a syscall that would do an IPI for us
> >>> (on top of my head - membarrier() does that, might be there are some other syscalls too),
> >>> or something hand-made. For hand-made, I wonder would something like that
> >>> be safe and sufficient:
> >>> uint64_t val = atomic_load(addr);
> >>> CAS(addr, val, &val);
> >>> ?
> >>> Anyway, one way or another - I think ability to wakeup core we put to sleep
> >>> have to be an essential part of this feature.
> >>> As I understand linux kernel will limit max amount of sleep time for these instructions:
> >>> https://lwn.net/Articles/790920/
> >>> But relying just on that, seems too vague for me:
> >>> - user can adjust that value
> >>> - wouldn't apply to older kernels and non-linux cases
> >>> Konstantin
> >>>
> >>
> >> This implies knowing the value the core is sleeping on.
> >
> > You don't the value to wait for, you just need an address.
> > And you can make wakeup function to accept address as a parameter,
> > same as monitor() does.
> 
> Sorry, i meant the address. We don't know the address we're sleeping on.
> 
> >
> >> That's not
> >> always the case - with this particular PMD power management scheme, we
> >> get the address from the PMD and it stays inside the callback.
> >
> > That's fine - you can store address inside you callback metadata
> > and do wakeup as part of _disable_ function.
> >
> 
> The address may be different, and by the time we access the address it
> may become stale, so i don't see how that would help unless you're
> suggesting to have some kind of synchronization mechanism there.

Yes, we'll need something to sync here for sure.
Sorry, I should say it straightway, to avoid further misunderstanding.
Let say, associate a spin_lock with monitor(), by analogy with pthread_cond_wait().  
Konstantin

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
@ 2020-10-11 10:07         ` Jerin Jacob
  2020-10-12  9:26           ` Burakov, Anatoly
  2020-10-12 19:52         ` David Christensen
  1 sibling, 1 reply; 421+ messages in thread
From: Jerin Jacob @ 2020-10-11 10:07 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dpdk-dev, Jan Viktorin, Ruifeng Wang, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	David Hunt, Liang Ma, Thomas Monjalon, McDaniel, Timothy,
	Gage Eads, chris.macnamara

On Fri, Oct 9, 2020 at 9:32 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> Currently, it is not possible to check support for intrinsics that
> are platform-specific, cannot be abstracted in a generic way, or do not
> have support on all architectures. The CPUID flags can be used to some
> extent, but they are only defined for their platform, while intrinsics
> will be available to all code as they are in generic headers.
>
> This patch introduces infrastructure to check support for certain
> platform-specific intrinsics, and adds support for checking support for
> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  .../arm/include/rte_power_intrinsics.h        |  8 ++++++
>  lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
>  lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
>  .../include/generic/rte_power_intrinsics.h    |  8 ++++++
>  .../ppc/include/rte_power_intrinsics.h        |  8 ++++++
>  lib/librte_eal/ppc/rte_cpuflags.c             |  6 +++++
>  lib/librte_eal/rte_eal_version.map            |  1 +
>  .../x86/include/rte_power_intrinsics.h        |  8 ++++++
>  lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
>  9 files changed, 83 insertions(+)
>
> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> index 4aad44a0b9..055ec5877a 100644
> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> @@ -17,6 +17,10 @@ extern "C" {
>  /**
>   * This function is not supported on ARM.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.

See below

> + *
>   * @param p
>   *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>   * @param expected_value
> @@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
>  /**
>   * This function is not supported on ARM.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
See below

This patch looks to me.

Since rte_power_monitor() API is public API, I think, only in the
generic header file, you need to have
these warnings and API documentation rather than repeating everywhere.



>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for.
>   *
> diff --git a/lib/librte_eal/arm/rte_cpuflags.c b/lib/librte_eal/arm/rte_cpuflags.c
> index caf3dc83a5..7eef11fa02 100644
> --- a/lib/librte_eal/arm/rte_cpuflags.c
> +++ b/lib/librte_eal/arm/rte_cpuflags.c
> @@ -138,3 +138,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>                 return NULL;
>         return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +       memset(intrinsics, 0, sizeof(*intrinsics));
> +}
> diff --git a/lib/librte_eal/include/generic/rte_cpuflags.h b/lib/librte_eal/include/generic/rte_cpuflags.h
> index 872f0ebe3e..28a5aecde8 100644
> --- a/lib/librte_eal/include/generic/rte_cpuflags.h
> +++ b/lib/librte_eal/include/generic/rte_cpuflags.h
> @@ -13,6 +13,32 @@
>  #include "rte_common.h"
>  #include <errno.h>
>
> +#include <rte_compat.h>
> +
> +/**
> + * Structure used to describe platform-specific intrinsics that may or may not
> + * be supported at runtime.
> + */
> +struct rte_cpu_intrinsics {
> +       uint32_t power_monitor : 1;
> +       /**< indicates support for rte_power_monitor function */
> +       uint32_t power_pause : 1;
> +       /**< indicates support for rte_power_pause function */
> +};
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Check CPU support for various intrinsics at runtime.
> + *
> + * @param intrinsics
> + *     Pointer to a structure to be filled.
> + */
> +__rte_experimental
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics);
> +
>  /**
>   * Enumeration of all CPU features supported
>   */
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> index e36c1f8976..218eda7e86 100644
> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -26,6 +26,10 @@
>   * checked against the expected value, and if they match, the entering of
>   * optimized power state may be aborted.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.

See above

> + *
>   * @param p
>   *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>   * @param expected_value
> @@ -49,6 +53,10 @@ static inline void rte_power_monitor(const volatile void *p,
>   * Enter an architecture-defined optimized power state until a certain TSC
>   * timestamp is reached.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for. Note that the wait behavior is
>   *   architecture-dependent.
> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> index 70fd7b094f..d63ad86849 100644
> --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -17,6 +17,10 @@ extern "C" {
>  /**
>   * This function is not supported on PPC64.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param p
>   *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>   * @param expected_value
> @@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
>  /**
>   * This function is not supported on PPC64.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for.
>   *
> diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
> index 3bb7563ce9..eee8234384 100644
> --- a/lib/librte_eal/ppc/rte_cpuflags.c
> +++ b/lib/librte_eal/ppc/rte_cpuflags.c
> @@ -108,3 +108,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>                 return NULL;
>         return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +       memset(intrinsics, 0, sizeof(*intrinsics));
> +}
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index a93dea9fe6..ed944f2bd4 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -400,6 +400,7 @@ EXPERIMENTAL {
>         # added in 20.11
>         __rte_eal_trace_generic_size_t;
>         rte_service_lcore_may_be_active;
> +       rte_cpu_get_intrinsics_support;
>  };
>
>  INTERNAL {
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> index 8d579eaf64..3afc165a1f 100644
> --- a/lib/librte_eal/x86/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -29,6 +29,10 @@ extern "C" {
>   * For more information about usage of these instructions, please refer to
>   * Intel(R) 64 and IA-32 Architectures Software Developer's Manual.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param p
>   *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>   * @param expected_value
> @@ -80,6 +84,10 @@ static inline void rte_power_monitor(const volatile void *p,
>   * information about usage of this instruction, please refer to Intel(R) 64 and
>   * IA-32 Architectures Software Developer's Manual.
>   *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>   * @param tsc_timestamp
>   *   Maximum TSC timestamp to wait for.
>   *
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 0325c4b93b..a96312ff7f 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -7,6 +7,7 @@
>  #include <stdio.h>
>  #include <errno.h>
>  #include <stdint.h>
> +#include <string.h>
>
>  #include "rte_cpuid.h"
>
> @@ -179,3 +180,14 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>                 return NULL;
>         return rte_cpu_feature_table[feature].name;
>  }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +       memset(intrinsics, 0, sizeof(*intrinsics));
> +
> +       if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> +               intrinsics->power_monitor = 1;
> +               intrinsics->power_pause = 1;
> +       }
> +}
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API Anatoly Burakov
@ 2020-10-12  7:46         ` Wang, Haiyue
  2020-10-12  9:28           ` Burakov, Anatoly
  2020-10-12  9:44           ` Burakov, Anatoly
  2020-10-12  8:09         ` Wang, Haiyue
  1 sibling, 2 replies; 421+ messages in thread
From: Wang, Haiyue @ 2020-10-12  7:46 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris

Hi Liang,

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Saturday, October 10, 2020 00:02
> To: dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
> Macnamara, Chris <chris.macnamara@intel.com>
> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's
> status bit.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>  drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
>  drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
> index 0b98e210e7..30b3f416d4 100644
> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> @@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
>  	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
>  	.tm_ops_get           = ixgbe_tm_ops_get,
>  	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
> +	.get_wake_addr        = ixgbe_get_wake_addr,
>  };
> 
>  /*
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index 977ecf5137..7a9fd2aec6 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -1366,6 +1366,28 @@ const uint32_t
>  		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
>  };
> 
> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *mask)
> +{
> +	volatile union ixgbe_adv_rx_desc *rxdp;
> +	struct ixgbe_rx_queue *rxq = rx_queue;
> +	uint16_t desc;
> +
> +	desc = rxq->rx_tail;
> +	rxdp = &rxq->rx_ring[desc];
> +	/* watch for changes in status bit */
> +	*tail_desc_addr = &rxdp->wb.upper.status_error;
> +
> +	/*
> +	 * we expect the DD bit to be set to 1 if this descriptor was already
> +	 * written to.
> +	 */
> +	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> +	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> +
> +	return 0;
> +}
> +

I'm wondering that whether the '.get_wake_addr' can be specific to
like 'rxq_tailq_addr_get' ? So that one day this wake up mechanism
can be applied to 'txq_tailq_addr_get' ? :-)

Also, "volatile void **tail_desc_addr, uint64_t *expected, uint64_t *mask"
can be merged into 'struct xxx' ? So that you can expand the API easily.

Just my thoughts.

Anyway, LGTM

Acked-by: Haiyue Wang <haiyue.wang@intel.com>

> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API Anatoly Burakov
  2020-10-12  7:46         ` Wang, Haiyue
@ 2020-10-12  8:09         ` Wang, Haiyue
  2020-10-12  9:28           ` Burakov, Anatoly
  1 sibling, 1 reply; 421+ messages in thread
From: Wang, Haiyue @ 2020-10-12  8:09 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris

Hi Liang,

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Saturday, October 10, 2020 00:02
> To: dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
> Macnamara, Chris <chris.macnamara@intel.com>
> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's
> status bit.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>  drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
>  drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>  3 files changed, 25 insertions(+)
> 


> 
> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *mask)
> +{
> +	volatile union ixgbe_adv_rx_desc *rxdp;
> +	struct ixgbe_rx_queue *rxq = rx_queue;
> +	uint16_t desc;
> +
> +	desc = rxq->rx_tail;
> +	rxdp = &rxq->rx_ring[desc];
> +	/* watch for changes in status bit */
> +	*tail_desc_addr = &rxdp->wb.upper.status_error;
> +
> +	/*
> +	 * we expect the DD bit to be set to 1 if this descriptor was already
> +	 * written to.
> +	 */
> +	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> +	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> +

Seems have one issue about the byte endian:
Like for BIG endian:
         *expected = rte_bswap32(IXGBE_RXDADV_STAT_DD)
             !=
         *expected = rte_bswap64(IXGBE_RXDADV_STAT_DD)

And in API 'rte_power_monitor', use uint64_t type to access the wake up
data:

static inline void rte_power_monitor(const volatile void *p,
		const uint64_t expected_value, const uint64_t value_mask,
		const uint64_t tsc_timestamp)
{
	if (value_mask) {
			const uint64_t cur_value = *(const volatile uint64_t *)p;
			const uint64_t masked = cur_value & value_mask;
			/* if the masked value is already matching, abort */
			if (masked == expected_value)
				return;
		}


So that we need the wake up address type like 16/32/64b ?

> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure
  2020-10-11 10:07         ` Jerin Jacob
@ 2020-10-12  9:26           ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12  9:26 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Jan Viktorin, Ruifeng Wang, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev,
	David Hunt, Liang Ma, Thomas Monjalon, McDaniel, Timothy,
	Gage Eads, chris.macnamara

On 11-Oct-20 11:07 AM, Jerin Jacob wrote:
> On Fri, Oct 9, 2020 at 9:32 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> Currently, it is not possible to check support for intrinsics that
>> are platform-specific, cannot be abstracted in a generic way, or do not
>> have support on all architectures. The CPUID flags can be used to some
>> extent, but they are only defined for their platform, while intrinsics
>> will be available to all code as they are in generic headers.
>>
>> This patch introduces infrastructure to check support for certain
>> platform-specific intrinsics, and adds support for checking support for
>> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   .../arm/include/rte_power_intrinsics.h        |  8 ++++++
>>   lib/librte_eal/arm/rte_cpuflags.c             |  6 +++++
>>   lib/librte_eal/include/generic/rte_cpuflags.h | 26 +++++++++++++++++++
>>   .../include/generic/rte_power_intrinsics.h    |  8 ++++++
>>   .../ppc/include/rte_power_intrinsics.h        |  8 ++++++
>>   lib/librte_eal/ppc/rte_cpuflags.c             |  6 +++++
>>   lib/librte_eal/rte_eal_version.map            |  1 +
>>   .../x86/include/rte_power_intrinsics.h        |  8 ++++++
>>   lib/librte_eal/x86/rte_cpuflags.c             | 12 +++++++++
>>   9 files changed, 83 insertions(+)
>>
>> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> index 4aad44a0b9..055ec5877a 100644
>> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
>> @@ -17,6 +17,10 @@ extern "C" {
>>   /**
>>    * This function is not supported on ARM.
>>    *
>> + * @warning It is responsibility of the user to check if this function is
>> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
>> + *   so may result in an illegal CPU instruction error.
> 
> See below
> 
>> + *
>>    * @param p
>>    *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>>    * @param expected_value
>> @@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
>>   /**
>>    * This function is not supported on ARM.
>>    *
>> + * @warning It is responsibility of the user to check if this function is
>> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
>> + *   so may result in an illegal CPU instruction error.
>> + *
> See below
> 
> This patch looks to me.
> 
> Since rte_power_monitor() API is public API, I think, only in the
> generic header file, you need to have
> these warnings and API documentation rather than repeating everywhere.
> 

Great, will fix in v6 so. Thanks!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-12  8:09         ` Wang, Haiyue
@ 2020-10-12  9:28           ` Burakov, Anatoly
  2020-10-13  1:17             ` Wang, Haiyue
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12  9:28 UTC (permalink / raw)
  To: Wang, Haiyue, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara, Chris

On 12-Oct-20 9:09 AM, Wang, Haiyue wrote:
> Hi Liang,
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Saturday, October 10, 2020 00:02
>> To: dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
>> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
>> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
>> Macnamara, Chris <chris.macnamara@intel.com>
>> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Implement support for the power management API by implementing a
>> `get_wake_addr` function that will return an address of an RX ring's
>> status bit.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> ---
>>   drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>>   drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
>>   drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>>   3 files changed, 25 insertions(+)
>>
> 
> 
>>
>> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
>> +uint64_t *expected, uint64_t *mask)
>> +{
>> +volatile union ixgbe_adv_rx_desc *rxdp;
>> +struct ixgbe_rx_queue *rxq = rx_queue;
>> +uint16_t desc;
>> +
>> +desc = rxq->rx_tail;
>> +rxdp = &rxq->rx_ring[desc];
>> +/* watch for changes in status bit */
>> +*tail_desc_addr = &rxdp->wb.upper.status_error;
>> +
>> +/*
>> + * we expect the DD bit to be set to 1 if this descriptor was already
>> + * written to.
>> + */
>> +*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +
> 
> Seems have one issue about the byte endian:
> Like for BIG endian:
>           *expected = rte_bswap32(IXGBE_RXDADV_STAT_DD)
>               !=
>           *expected = rte_bswap64(IXGBE_RXDADV_STAT_DD)
> 
> And in API 'rte_power_monitor', use uint64_t type to access the wake up
> data:
> 
> static inline void rte_power_monitor(const volatile void *p,
> const uint64_t expected_value, const uint64_t value_mask,
> const uint64_t tsc_timestamp)
> {
> if (value_mask) {
> const uint64_t cur_value = *(const volatile uint64_t *)p;
> const uint64_t masked = cur_value & value_mask;
> /* if the masked value is already matching, abort */
> if (masked == expected_value)
> return;
> }
> 
> 
> So that we need the wake up address type like 16/32/64b ?

Endian differences strike again! You're right of course.

I suspect casting everything to CPU endinanness would fix it, would it not?

> 
>> --
>> 2.17.1


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-12  7:46         ` Wang, Haiyue
@ 2020-10-12  9:28           ` Burakov, Anatoly
  2020-10-12  9:44           ` Burakov, Anatoly
  1 sibling, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12  9:28 UTC (permalink / raw)
  To: Wang, Haiyue, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara, Chris

On 12-Oct-20 8:46 AM, Wang, Haiyue wrote:
> Hi Liang,
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Saturday, October 10, 2020 00:02
>> To: dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
>> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
>> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
>> Macnamara, Chris <chris.macnamara@intel.com>
>> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Implement support for the power management API by implementing a
>> `get_wake_addr` function that will return an address of an RX ring's
>> status bit.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> ---
>>   drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>>   drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
>>   drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
>> index 0b98e210e7..30b3f416d4 100644
>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>> @@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
>>   .udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
>>   .tm_ops_get           = ixgbe_tm_ops_get,
>>   .tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
>> +.get_wake_addr        = ixgbe_get_wake_addr,
>>   };
>>
>>   /*
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index 977ecf5137..7a9fd2aec6 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1366,6 +1366,28 @@ const uint32_t
>>   RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
>>   };
>>
>> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
>> +uint64_t *expected, uint64_t *mask)
>> +{
>> +volatile union ixgbe_adv_rx_desc *rxdp;
>> +struct ixgbe_rx_queue *rxq = rx_queue;
>> +uint16_t desc;
>> +
>> +desc = rxq->rx_tail;
>> +rxdp = &rxq->rx_ring[desc];
>> +/* watch for changes in status bit */
>> +*tail_desc_addr = &rxdp->wb.upper.status_error;
>> +
>> +/*
>> + * we expect the DD bit to be set to 1 if this descriptor was already
>> + * written to.
>> + */
>> +*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +
>> +return 0;
>> +}
>> +
> 
> I'm wondering that whether the '.get_wake_addr' can be specific to
> like 'rxq_tailq_addr_get' ? So that one day this wake up mechanism
> can be applied to 'txq_tailq_addr_get' ? :-)
> 
> Also, "volatile void **tail_desc_addr, uint64_t *expected, uint64_t *mask"
> can be merged into 'struct xxx' ? So that you can expand the API easily.
> 
> Just my thoughts.
> 
> Anyway, LGTM
> 
> Acked-by: Haiyue Wang <haiyue.wang@intel.com>

Great point, will think of how to address it.

> 
>> --
>> 2.17.1


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-12  7:46         ` Wang, Haiyue
  2020-10-12  9:28           ` Burakov, Anatoly
@ 2020-10-12  9:44           ` Burakov, Anatoly
  2020-10-12 15:58             ` Wang, Haiyue
  1 sibling, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12  9:44 UTC (permalink / raw)
  To: Wang, Haiyue, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara, Chris

On 12-Oct-20 8:46 AM, Wang, Haiyue wrote:
> Hi Liang,
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Saturday, October 10, 2020 00:02
>> To: dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
>> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
>> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
>> Macnamara, Chris <chris.macnamara@intel.com>
>> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Implement support for the power management API by implementing a
>> `get_wake_addr` function that will return an address of an RX ring's
>> status bit.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> ---
>>   drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
>>   drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
>>   drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
>> index 0b98e210e7..30b3f416d4 100644
>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>> @@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
>>   .udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
>>   .tm_ops_get           = ixgbe_tm_ops_get,
>>   .tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
>> +.get_wake_addr        = ixgbe_get_wake_addr,
>>   };
>>
>>   /*
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index 977ecf5137..7a9fd2aec6 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1366,6 +1366,28 @@ const uint32_t
>>   RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
>>   };
>>
>> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
>> +uint64_t *expected, uint64_t *mask)
>> +{
>> +volatile union ixgbe_adv_rx_desc *rxdp;
>> +struct ixgbe_rx_queue *rxq = rx_queue;
>> +uint16_t desc;
>> +
>> +desc = rxq->rx_tail;
>> +rxdp = &rxq->rx_ring[desc];
>> +/* watch for changes in status bit */
>> +*tail_desc_addr = &rxdp->wb.upper.status_error;
>> +
>> +/*
>> + * we expect the DD bit to be set to 1 if this descriptor was already
>> + * written to.
>> + */
>> +*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
>> +
>> +return 0;
>> +}
>> +
> 
> I'm wondering that whether the '.get_wake_addr' can be specific to
> like 'rxq_tailq_addr_get' ? So that one day this wake up mechanism
> can be applied to 'txq_tailq_addr_get' ? :-)

What would be the point of sleeping on TX queue though?

> 
> Also, "volatile void **tail_desc_addr, uint64_t *expected, uint64_t *mask"
> can be merged into 'struct xxx' ? So that you can expand the API easily.

Actually, i don't think we can do that. Well, we can, but we'll have to 
either define a new struct for ethdev, or define it in the power library 
and make ethdev dependent on the power library. The latter is a no-go, 
and the former i don't think is a good idea because adding a new struct 
to ethdev is big deal and i'd like to avoid that if i can.

> 
> Just my thoughts.
> 
> Anyway, LGTM
> 
> Acked-by: Haiyue Wang <haiyue.wang@intel.com>
> 
>> --
>> 2.17.1


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-10 13:19                     ` Ananyev, Konstantin
@ 2020-10-12 10:35                       ` Burakov, Anatoly
  2020-10-12 10:36                         ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12 10:35 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 10-Oct-20 2:19 PM, Ananyev, Konstantin wrote:
> 
> 
>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>> compiler support for these instructions.
>>>>>>>>
>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>> happened, and no more writes are expected.
>>>>>>>
>>>>>>> I think what this API is missing - a function to wakeup sleeping core.
>>>>>>> If user can/should use some system call to achieve that, then at least
>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>
>>>>>> I don't think it's possible to do that without severely overcomplicating
>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>> sleeping core would be to send some kind of interrupt to the core, or
>>>>>> trigger a write to the cache-line in question.
>>>>>>
>>>>>
>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>> (on top of my head - membarrier() does that, might be there are some other syscalls too),
>>>>> or something hand-made. For hand-made, I wonder would something like that
>>>>> be safe and sufficient:
>>>>> uint64_t val = atomic_load(addr);
>>>>> CAS(addr, val, &val);
>>>>> ?
>>>>> Anyway, one way or another - I think ability to wakeup core we put to sleep
>>>>> have to be an essential part of this feature.
>>>>> As I understand linux kernel will limit max amount of sleep time for these instructions:
>>>>> https://lwn.net/Articles/790920/
>>>>> But relying just on that, seems too vague for me:
>>>>> - user can adjust that value
>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>> Konstantin
>>>>>
>>>>
>>>> This implies knowing the value the core is sleeping on.
>>>
>>> You don't the value to wait for, you just need an address.
>>> And you can make wakeup function to accept address as a parameter,
>>> same as monitor() does.
>>
>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>
>>>
>>>> That's not
>>>> always the case - with this particular PMD power management scheme, we
>>>> get the address from the PMD and it stays inside the callback.
>>>
>>> That's fine - you can store address inside you callback metadata
>>> and do wakeup as part of _disable_ function.
>>>
>>
>> The address may be different, and by the time we access the address it
>> may become stale, so i don't see how that would help unless you're
>> suggesting to have some kind of synchronization mechanism there.
> 
> Yes, we'll need something to sync here for sure.
> Sorry, I should say it straightway, to avoid further misunderstanding.
> Let say, associate a spin_lock with monitor(), by analogy with pthread_cond_wait().
> Konstantin
> 

The idea was to provide an intrinsic-like function - as in, raw 
instruction call, without anything extra. We even added the masks/values 
etc. only because there's no race-less way to combine UMONITOR/UMWAIT 
without those.

Perhaps we can provide a synchronize-able wrapper around it to avoid 
adding overhead to calls that function but doesn't need the sync mechanism?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-12 10:35                       ` Burakov, Anatoly
@ 2020-10-12 10:36                         ` Burakov, Anatoly
  2020-10-12 12:50                           ` Ananyev, Konstantin
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12 10:36 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 12-Oct-20 11:35 AM, Burakov, Anatoly wrote:
> On 10-Oct-20 2:19 PM, Ananyev, Konstantin wrote:
>>
>>
>>>>>>>>> Add two new power management intrinsics, and provide an 
>>>>>>>>> implementation
>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>> are implemented as raw byte opcodes because there is not yet 
>>>>>>>>> widespread
>>>>>>>>> compiler support for these instructions.
>>>>>>>>>
>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>> function to either wait until a specified TSC timestamp is 
>>>>>>>>> reached, or
>>>>>>>>> optionally wait until either a TSC timestamp is reached or a 
>>>>>>>>> memory
>>>>>>>>> location is written to. The monitor function also provides an 
>>>>>>>>> optional
>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>
>>>>>>>> I think what this API is missing - a function to wakeup sleeping 
>>>>>>>> core.
>>>>>>>> If user can/should use some system call to achieve that, then at 
>>>>>>>> least
>>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>>
>>>>>>> I don't think it's possible to do that without severely 
>>>>>>> overcomplicating
>>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>>> sleeping core would be to send some kind of interrupt to the 
>>>>>>> core, or
>>>>>>> trigger a write to the cache-line in question.
>>>>>>>
>>>>>>
>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>> (on top of my head - membarrier() does that, might be there are 
>>>>>> some other syscalls too),
>>>>>> or something hand-made. For hand-made, I wonder would something 
>>>>>> like that
>>>>>> be safe and sufficient:
>>>>>> uint64_t val = atomic_load(addr);
>>>>>> CAS(addr, val, &val);
>>>>>> ?
>>>>>> Anyway, one way or another - I think ability to wakeup core we put 
>>>>>> to sleep
>>>>>> have to be an essential part of this feature.
>>>>>> As I understand linux kernel will limit max amount of sleep time 
>>>>>> for these instructions:
>>>>>> https://lwn.net/Articles/790920/
>>>>>> But relying just on that, seems too vague for me:
>>>>>> - user can adjust that value
>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>> Konstantin
>>>>>>
>>>>>
>>>>> This implies knowing the value the core is sleeping on.
>>>>
>>>> You don't the value to wait for, you just need an address.
>>>> And you can make wakeup function to accept address as a parameter,
>>>> same as monitor() does.
>>>
>>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>>
>>>>
>>>>> That's not
>>>>> always the case - with this particular PMD power management scheme, we
>>>>> get the address from the PMD and it stays inside the callback.
>>>>
>>>> That's fine - you can store address inside you callback metadata
>>>> and do wakeup as part of _disable_ function.
>>>>
>>>
>>> The address may be different, and by the time we access the address it
>>> may become stale, so i don't see how that would help unless you're
>>> suggesting to have some kind of synchronization mechanism there.
>>
>> Yes, we'll need something to sync here for sure.
>> Sorry, I should say it straightway, to avoid further misunderstanding.
>> Let say, associate a spin_lock with monitor(), by analogy with 
>> pthread_cond_wait().
>> Konstantin
>>
> 
> The idea was to provide an intrinsic-like function - as in, raw 
> instruction call, without anything extra. We even added the masks/values 
> etc. only because there's no race-less way to combine UMONITOR/UMWAIT 
> without those.
> 
> Perhaps we can provide a synchronize-able wrapper around it to avoid 
> adding overhead to calls that function but doesn't need the sync mechanism?
> 

Also, how would having a spinlock help to synchronize? Are you 
suggesting we do UMWAIT on a spinlock address, or something to that effect?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-12 10:36                         ` Burakov, Anatoly
@ 2020-10-12 12:50                           ` Ananyev, Konstantin
  2020-10-12 13:13                             ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Ananyev, Konstantin @ 2020-10-12 12:50 UTC (permalink / raw)
  To: Burakov, Anatoly, Ma, Liang J, dev; +Cc: Hunt, David, stephen


> >>
> >>>>>>>>> Add two new power management intrinsics, and provide an
> >>>>>>>>> implementation
> >>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>> are implemented as raw byte opcodes because there is not yet
> >>>>>>>>> widespread
> >>>>>>>>> compiler support for these instructions.
> >>>>>>>>>
> >>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>> function to either wait until a specified TSC timestamp is
> >>>>>>>>> reached, or
> >>>>>>>>> optionally wait until either a TSC timestamp is reached or a
> >>>>>>>>> memory
> >>>>>>>>> location is written to. The monitor function also provides an
> >>>>>>>>> optional
> >>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>
> >>>>>>>> I think what this API is missing - a function to wakeup sleeping
> >>>>>>>> core.
> >>>>>>>> If user can/should use some system call to achieve that, then at
> >>>>>>>> least
> >>>>>>>> it has to be clearly documented, even better some wrapper provided.
> >>>>>>>
> >>>>>>> I don't think it's possible to do that without severely
> >>>>>>> overcomplicating
> >>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
> >>>>>>> sleeping core would be to send some kind of interrupt to the
> >>>>>>> core, or
> >>>>>>> trigger a write to the cache-line in question.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, I think we either need a syscall that would do an IPI for us
> >>>>>> (on top of my head - membarrier() does that, might be there are
> >>>>>> some other syscalls too),
> >>>>>> or something hand-made. For hand-made, I wonder would something
> >>>>>> like that
> >>>>>> be safe and sufficient:
> >>>>>> uint64_t val = atomic_load(addr);
> >>>>>> CAS(addr, val, &val);
> >>>>>> ?
> >>>>>> Anyway, one way or another - I think ability to wakeup core we put
> >>>>>> to sleep
> >>>>>> have to be an essential part of this feature.
> >>>>>> As I understand linux kernel will limit max amount of sleep time
> >>>>>> for these instructions:
> >>>>>> https://lwn.net/Articles/790920/
> >>>>>> But relying just on that, seems too vague for me:
> >>>>>> - user can adjust that value
> >>>>>> - wouldn't apply to older kernels and non-linux cases
> >>>>>> Konstantin
> >>>>>>
> >>>>>
> >>>>> This implies knowing the value the core is sleeping on.
> >>>>
> >>>> You don't the value to wait for, you just need an address.
> >>>> And you can make wakeup function to accept address as a parameter,
> >>>> same as monitor() does.
> >>>
> >>> Sorry, i meant the address. We don't know the address we're sleeping on.
> >>>
> >>>>
> >>>>> That's not
> >>>>> always the case - with this particular PMD power management scheme, we
> >>>>> get the address from the PMD and it stays inside the callback.
> >>>>
> >>>> That's fine - you can store address inside you callback metadata
> >>>> and do wakeup as part of _disable_ function.
> >>>>
> >>>
> >>> The address may be different, and by the time we access the address it
> >>> may become stale, so i don't see how that would help unless you're
> >>> suggesting to have some kind of synchronization mechanism there.
> >>
> >> Yes, we'll need something to sync here for sure.
> >> Sorry, I should say it straightway, to avoid further misunderstanding.
> >> Let say, associate a spin_lock with monitor(), by analogy with
> >> pthread_cond_wait().
> >> Konstantin
> >>
> >
> > The idea was to provide an intrinsic-like function - as in, raw
> > instruction call, without anything extra. We even added the masks/values
> > etc. only because there's no race-less way to combine UMONITOR/UMWAIT
> > without those.
>>
> > Perhaps we can provide a synchronize-able wrapper around it to avoid
> > adding overhead to calls that function but doesn't need the sync mechanism?

Yes, might be two flavours, something like
rte_power_monitor() and rte_power_monitor_sync() 
or whatever would be a better name.

> >
> 
> Also, how would having a spinlock help to synchronize? Are you
> suggesting we do UMWAIT on a spinlock address, or something to that effect?
> 

I thought about something very similar to cond_wait() working model:

/*
 * Caller has to obtain lock before calling that function.
 */
static inline int rte_power_monitor_sync(const volatile void *p,
                const uint64_t expected_value, const uint64_t value_mask,
                const uint32_t state, const uint64_t tsc_timestamp, rte_spinlock_t *lck)
{
	/* do whatever preparations are needed */
               ....
	umonitor(p);

	if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) {
		return 0;
 	}
	
	/* release lock and go to sleep */
	rte_spinlock_unlock(lck);
	rflags = umwait();

	/* grab lock back after wakeup */
	rte_spinlock_lock(lck);

	/* do rest of processing */
	....
}

/* similar go cond_signal */
static inline void rte_power_monitor_wakeup(volatile void *p)
{
	uint64_t v;

	v = __atomic_load_n(p, __ATOMIC_RELAXED);
	__atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
}               


Now in librte_power:

struct pmd_queue_cfg {
       /* to protect state and wait_addr */
       rte_spinlock_t lck;
       enum pmd_mgmt_state pwr_mgmt_state;
       void *wait_addr;
       /* rest fields */
      ....
} __rte_cache_aligned;


static uint16_t
rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
                struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
                uint16_t max_pkts __rte_unused, void *_  __rte_unused)
{

        struct pmd_queue_cfg *q_conf;
        q_conf = &port_cfg[port_id].queue_cfg[qidx];

        if (unlikely(nb_rx == 0)) {
                q_conf->empty_poll_stats++;
                if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
                        volatile void *target_addr;
                        uint64_t expected, mask;
                        uint16_t ret;
		
	         /* grab the lock and check the state */
                       rte_spinlock_lock(&q_conf->lck);
	         If (q-conf->state == ENABLED) {
	                        ret = rte_eth_get_wake_addr(port_id, qidx,
                                                    &target_addr, &expected, &mask);
		          If (ret == 0) {
			q_conf->wait_addr = target_addr;
			rte_power_monitor(target_addr, ..., &q_conf->lck);
		         }	
		          /* reset the wait_addr */
		          q_conf->wait_addr = NULL;
	         }
	         rte_spinlock_unlock(&q_conf->lck);	
	         ....
}

nt
rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
                                uint16_t port_id,
                                uint16_t queue_id)
{
	...
	/* grab the lock and change the state */
               rte_spinlock_lock(&q_conf->lck);
	queue_cfg->state = DISABLED;

	/* wakeup if necessary */
	If (queue_cfg->wakeup_addr != NULL)
		rte_power_monitor_wakeup(queue_cfg->wakeup_addr);

	rte_spinlock_unlock(&q_conf->lck);
	...
}

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-12 12:50                           ` Ananyev, Konstantin
@ 2020-10-12 13:13                             ` Burakov, Anatoly
  2020-10-13  9:45                               ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-12 13:13 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 12-Oct-20 1:50 PM, Ananyev, Konstantin wrote:
> 
>>>>
>>>>>>>>>>> Add two new power management intrinsics, and provide an
>>>>>>>>>>> implementation
>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet
>>>>>>>>>>> widespread
>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>
>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>> function to either wait until a specified TSC timestamp is
>>>>>>>>>>> reached, or
>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a
>>>>>>>>>>> memory
>>>>>>>>>>> location is written to. The monitor function also provides an
>>>>>>>>>>> optional
>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>
>>>>>>>>>> I think what this API is missing - a function to wakeup sleeping
>>>>>>>>>> core.
>>>>>>>>>> If user can/should use some system call to achieve that, then at
>>>>>>>>>> least
>>>>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>>>>
>>>>>>>>> I don't think it's possible to do that without severely
>>>>>>>>> overcomplicating
>>>>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>>>>> sleeping core would be to send some kind of interrupt to the
>>>>>>>>> core, or
>>>>>>>>> trigger a write to the cache-line in question.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>>>> (on top of my head - membarrier() does that, might be there are
>>>>>>>> some other syscalls too),
>>>>>>>> or something hand-made. For hand-made, I wonder would something
>>>>>>>> like that
>>>>>>>> be safe and sufficient:
>>>>>>>> uint64_t val = atomic_load(addr);
>>>>>>>> CAS(addr, val, &val);
>>>>>>>> ?
>>>>>>>> Anyway, one way or another - I think ability to wakeup core we put
>>>>>>>> to sleep
>>>>>>>> have to be an essential part of this feature.
>>>>>>>> As I understand linux kernel will limit max amount of sleep time
>>>>>>>> for these instructions:
>>>>>>>> https://lwn.net/Articles/790920/
>>>>>>>> But relying just on that, seems too vague for me:
>>>>>>>> - user can adjust that value
>>>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>>>> Konstantin
>>>>>>>>
>>>>>>>
>>>>>>> This implies knowing the value the core is sleeping on.
>>>>>>
>>>>>> You don't the value to wait for, you just need an address.
>>>>>> And you can make wakeup function to accept address as a parameter,
>>>>>> same as monitor() does.
>>>>>
>>>>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>>>>
>>>>>>
>>>>>>> That's not
>>>>>>> always the case - with this particular PMD power management scheme, we
>>>>>>> get the address from the PMD and it stays inside the callback.
>>>>>>
>>>>>> That's fine - you can store address inside you callback metadata
>>>>>> and do wakeup as part of _disable_ function.
>>>>>>
>>>>>
>>>>> The address may be different, and by the time we access the address it
>>>>> may become stale, so i don't see how that would help unless you're
>>>>> suggesting to have some kind of synchronization mechanism there.
>>>>
>>>> Yes, we'll need something to sync here for sure.
>>>> Sorry, I should say it straightway, to avoid further misunderstanding.
>>>> Let say, associate a spin_lock with monitor(), by analogy with
>>>> pthread_cond_wait().
>>>> Konstantin
>>>>
>>>
>>> The idea was to provide an intrinsic-like function - as in, raw
>>> instruction call, without anything extra. We even added the masks/values
>>> etc. only because there's no race-less way to combine UMONITOR/UMWAIT
>>> without those.
>>>
>>> Perhaps we can provide a synchronize-able wrapper around it to avoid
>>> adding overhead to calls that function but doesn't need the sync mechanism?
> 
> Yes, might be two flavours, something like
> rte_power_monitor() and rte_power_monitor_sync()
> or whatever would be a better name.
> 
>>>
>>
>> Also, how would having a spinlock help to synchronize? Are you
>> suggesting we do UMWAIT on a spinlock address, or something to that effect?
>>
> 
> I thought about something very similar to cond_wait() working model:
> 
> /*
>   * Caller has to obtain lock before calling that function.
>   */
> static inline int rte_power_monitor_sync(const volatile void *p,
>                  const uint64_t expected_value, const uint64_t value_mask,
>                  const uint32_t state, const uint64_t tsc_timestamp, rte_spinlock_t *lck)
> {
> /* do whatever preparations are needed */
>                 ....
> umonitor(p);
> 
> if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) {
> return 0;
>   }
> 
> /* release lock and go to sleep */
> rte_spinlock_unlock(lck);
> rflags = umwait();
> 
> /* grab lock back after wakeup */
> rte_spinlock_lock(lck);
> 
> /* do rest of processing */
> ....
> }
> 
> /* similar go cond_signal */
> static inline void rte_power_monitor_wakeup(volatile void *p)
> {
> uint64_t v;
> 
> v = __atomic_load_n(p, __ATOMIC_RELAXED);
> __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
> }
> 
> 
> Now in librte_power:
> 
> struct pmd_queue_cfg {
>         /* to protect state and wait_addr */
>         rte_spinlock_t lck;
>         enum pmd_mgmt_state pwr_mgmt_state;
>         void *wait_addr;
>         /* rest fields */
>        ....
> } __rte_cache_aligned;
> 
> 
> static uint16_t
> rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
>                  struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>                  uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> {
> 
>          struct pmd_queue_cfg *q_conf;
>          q_conf = &port_cfg[port_id].queue_cfg[qidx];
> 
>          if (unlikely(nb_rx == 0)) {
>                  q_conf->empty_poll_stats++;
>                  if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>                          volatile void *target_addr;
>                          uint64_t expected, mask;
>                          uint16_t ret;
> 
>           /* grab the lock and check the state */
>                         rte_spinlock_lock(&q_conf->lck);
>           If (q-conf->state == ENABLED) {
>                          ret = rte_eth_get_wake_addr(port_id, qidx,
>                                                      &target_addr, &expected, &mask);
>            If (ret == 0) {
> q_conf->wait_addr = target_addr;
> rte_power_monitor(target_addr, ..., &q_conf->lck);
>           }
>            /* reset the wait_addr */
>            q_conf->wait_addr = NULL;
>           }
>           rte_spinlock_unlock(&q_conf->lck);
>           ....
> }
> 
> nt
> rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>                                  uint16_t port_id,
>                                  uint16_t queue_id)
> {
> ...
> /* grab the lock and change the state */
>                 rte_spinlock_lock(&q_conf->lck);
> queue_cfg->state = DISABLED;
> 
> /* wakeup if necessary */
> If (queue_cfg->wakeup_addr != NULL)
> rte_power_monitor_wakeup(queue_cfg->wakeup_addr);
> 
> rte_spinlock_unlock(&q_conf->lck);
> ...
> }
> 

Yeah, seems that i understood you correctly the first time then. I'm not 
completely convinced that this overhead and complexity is worth the 
trouble, to be honest. I mean, it's not like we're going to sleep 
indefinitely, this isn't like pthread wait - the biggest sleep time i've 
seen was around half a second and i'm not sure there is a use case for 
enabling/disabling this functionality willy nilly ever 5 seconds.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-12  9:44           ` Burakov, Anatoly
@ 2020-10-12 15:58             ` Wang, Haiyue
  0 siblings, 0 replies; 421+ messages in thread
From: Wang, Haiyue @ 2020-10-12 15:58 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Monday, October 12, 2020 17:45
> To: Wang, Haiyue <haiyue.wang@intel.com>; dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Hunt, David
> <david.hunt@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com;
> Richardson, Bruce <bruce.richardson@intel.com>; thomas@monjalon.net; McDaniel, Timothy
> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>; Macnamara, Chris
> <chris.macnamara@intel.com>
> Subject: Re: [PATCH v5 06/10] net/ixgbe: implement power management API
> 
> On 12-Oct-20 8:46 AM, Wang, Haiyue wrote:
> > Hi Liang,
> >
> >> -----Original Message-----
> >> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> >> Sent: Saturday, October 10, 2020 00:02
> >> To: dev@dpdk.org
> >> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
> >> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
> >> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce
> <bruce.richardson@intel.com>;
> >> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage
> <gage.eads@intel.com>;
> >> Macnamara, Chris <chris.macnamara@intel.com>
> >> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
> >>
> >> From: Liang Ma <liang.j.ma@intel.com>
> >>
> >> Implement support for the power management API by implementing a
> >> `get_wake_addr` function that will return an address of an RX ring's
> >> status bit.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> ---
> >>   drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
> >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
> >>   drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
> >>   3 files changed, 25 insertions(+)
> >>
> >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
> >> index 0b98e210e7..30b3f416d4 100644
> >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> >> @@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
> >>   .udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
> >>   .tm_ops_get           = ixgbe_tm_ops_get,
> >>   .tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
> >> +.get_wake_addr        = ixgbe_get_wake_addr,
> >>   };
> >>
> >>   /*
> >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> index 977ecf5137..7a9fd2aec6 100644
> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> @@ -1366,6 +1366,28 @@ const uint32_t
> >>   RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
> >>   };
> >>
> >> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> >> +uint64_t *expected, uint64_t *mask)
> >> +{
> >> +volatile union ixgbe_adv_rx_desc *rxdp;
> >> +struct ixgbe_rx_queue *rxq = rx_queue;
> >> +uint16_t desc;
> >> +
> >> +desc = rxq->rx_tail;
> >> +rxdp = &rxq->rx_ring[desc];
> >> +/* watch for changes in status bit */
> >> +*tail_desc_addr = &rxdp->wb.upper.status_error;
> >> +
> >> +/*
> >> + * we expect the DD bit to be set to 1 if this descriptor was already
> >> + * written to.
> >> + */
> >> +*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> >> +*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> >> +
> >> +return 0;
> >> +}
> >> +
> >
> > I'm wondering that whether the '.get_wake_addr' can be specific to
> > like 'rxq_tailq_addr_get' ? So that one day this wake up mechanism
> > can be applied to 'txq_tailq_addr_get' ? :-)
> 
> What would be the point of sleeping on TX queue though?

I checked, seems that the PMD uses internal index, no address, please ignore
this bad idea. ;-)

> 
> >
> > Also, "volatile void **tail_desc_addr, uint64_t *expected, uint64_t *mask"
> > can be merged into 'struct xxx' ? So that you can expand the API easily.
> 
> Actually, i don't think we can do that. Well, we can, but we'll have to
> either define a new struct for ethdev, or define it in the power library
> and make ethdev dependent on the power library. The latter is a no-go,
> and the former i don't think is a good idea because adding a new struct
> to ethdev is big deal and i'd like to avoid that if i can.

Understood the design now, thanks!

> 
> >
> > Just my thoughts.
> >
> > Anyway, LGTM
> >
> > Acked-by: Haiyue Wang <haiyue.wang@intel.com>
> >
> >> --
> >> 2.17.1
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 02/10] eal: add power management intrinsics Anatoly Burakov
  2020-10-09 16:09         ` Jerin Jacob
@ 2020-10-12 19:47         ` David Christensen
  1 sibling, 0 replies; 421+ messages in thread
From: David Christensen @ 2020-10-12 19:47 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Liang Ma, Jan Viktorin, Ruifeng Wang, Bruce Richardson,
	Konstantin Ananyev, david.hunt, jerinjacobk, thomas,
	timothy.mcdaniel, gage.eads, chris.macnamara



On 10/9/20 9:02 AM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> For more details, please refer to Intel(R) 64 and IA-32 Architectures
> Software Developer's Manual, Volume 2.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>      v5:
>      - Removed return values
>      - Simplified intrinsics and hardcoded C0.2 state
>      - Added other arch stubs
> 

... snip ...

> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_PPC_H_
> +#define _RTE_POWER_INTRINSIC_PPC_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * This function is not supported on PPC64.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 0 on success
> + */
> +static inline void rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint64_t tsc_timestamp)
> +{
> +	RTE_SET_USED(p);
> +	RTE_SET_USED(expected_value);
> +	RTE_SET_USED(value_mask);
> +	RTE_SET_USED(tsc_timestamp);
> +}
> +
> +/**
> + * This function is not supported on PPC64.
> + *
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline void rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +	RTE_SET_USED(tsc_timestamp);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_PPC_H_ */

I didn't find an equivalent instruction in the current 3.1 ISA, so not 
supported is correct for POWER.

Acked-by: David Christensen <drc@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 03/10] eal: add intrinsics support check infrastructure Anatoly Burakov
  2020-10-11 10:07         ` Jerin Jacob
@ 2020-10-12 19:52         ` David Christensen
  1 sibling, 0 replies; 421+ messages in thread
From: David Christensen @ 2020-10-12 19:52 UTC (permalink / raw)
  To: Anatoly Burakov, dev
  Cc: Jan Viktorin, Ruifeng Wang, Ray Kinsella, Neil Horman,
	Bruce Richardson, Konstantin Ananyev, david.hunt, liang.j.ma,
	jerinjacobk, thomas, timothy.mcdaniel, gage.eads,
	chris.macnamara



On 10/9/20 9:02 AM, Anatoly Burakov wrote:
> Currently, it is not possible to check support for intrinsics that
> are platform-specific, cannot be abstracted in a generic way, or do not
> have support on all architectures. The CPUID flags can be used to some
> extent, but they are only defined for their platform, while intrinsics
> will be available to all code as they are in generic headers.
> 
> This patch introduces infrastructure to check support for certain
> platform-specific intrinsics, and adds support for checking support for
> IA power management-related intrinsics for UMWAIT/UMONITOR and TPAUSE.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---

... snip ...

> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> index 70fd7b094f..d63ad86849 100644
> --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h
> @@ -17,6 +17,10 @@ extern "C" {
>   /**
>    * This function is not supported on PPC64.
>    *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>    * @param p
>    *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
>    * @param expected_value
> @@ -43,6 +47,10 @@ static inline void rte_power_monitor(const volatile void *p,
>   /**
>    * This function is not supported on PPC64.
>    *
> + * @warning It is responsibility of the user to check if this function is
> + *   supported at runtime using `rte_cpu_get_features()` API call. Failing to do
> + *   so may result in an illegal CPU instruction error.
> + *
>    * @param tsc_timestamp
>    *   Maximum TSC timestamp to wait for.
>    *
> diff --git a/lib/librte_eal/ppc/rte_cpuflags.c b/lib/librte_eal/ppc/rte_cpuflags.c
> index 3bb7563ce9..eee8234384 100644
> --- a/lib/librte_eal/ppc/rte_cpuflags.c
> +++ b/lib/librte_eal/ppc/rte_cpuflags.c
> @@ -108,3 +108,9 @@ rte_cpu_get_flag_name(enum rte_cpu_flag_t feature)
>   		return NULL;
>   	return rte_cpu_feature_table[feature].name;
>   }
> +
> +void
> +rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
> +{
> +	memset(intrinsics, 0, sizeof(*intrinsics));
> +}
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index a93dea9fe6..ed944f2bd4 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -400,6 +400,7 @@ EXPERIMENTAL {
>   	# added in 20.11
>   	__rte_eal_trace_generic_size_t;
>   	rte_service_lcore_may_be_active;
> +	rte_cpu_get_intrinsics_support;
>   };
> 
>   INTERNAL {

Acked-by: David Christensen <drc@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 06/10] net/ixgbe: implement power management API
  2020-10-12  9:28           ` Burakov, Anatoly
@ 2020-10-13  1:17             ` Wang, Haiyue
  0 siblings, 0 replies; 421+ messages in thread
From: Wang, Haiyue @ 2020-10-13  1:17 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Guo, Jia, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Monday, October 12, 2020 17:29
> To: Wang, Haiyue <haiyue.wang@intel.com>; dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Hunt, David
> <david.hunt@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com;
> Richardson, Bruce <bruce.richardson@intel.com>; thomas@monjalon.net; McDaniel, Timothy
> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>; Macnamara, Chris
> <chris.macnamara@intel.com>
> Subject: Re: [PATCH v5 06/10] net/ixgbe: implement power management API
> 
> On 12-Oct-20 9:09 AM, Wang, Haiyue wrote:
> > Hi Liang,
> >
> >> -----Original Message-----
> >> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> >> Sent: Saturday, October 10, 2020 00:02
> >> To: dev@dpdk.org
> >> Cc: Ma, Liang J <liang.j.ma@intel.com>; Guo, Jia <jia.guo@intel.com>; Wang, Haiyue
> >> <haiyue.wang@intel.com>; Hunt, David <david.hunt@intel.com>; Ananyev, Konstantin
> >> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson, Bruce
> <bruce.richardson@intel.com>;
> >> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Eads, Gage
> <gage.eads@intel.com>;
> >> Macnamara, Chris <chris.macnamara@intel.com>
> >> Subject: [PATCH v5 06/10] net/ixgbe: implement power management API
> >>
> >> From: Liang Ma <liang.j.ma@intel.com>
> >>
> >> Implement support for the power management API by implementing a
> >> `get_wake_addr` function that will return an address of an RX ring's
> >> status bit.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> ---
> >>   drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
> >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
> >>   drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
> >>   3 files changed, 25 insertions(+)
> >>
> >
> >
> >>
> >> +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> >> +uint64_t *expected, uint64_t *mask)
> >> +{
> >> +volatile union ixgbe_adv_rx_desc *rxdp;
> >> +struct ixgbe_rx_queue *rxq = rx_queue;
> >> +uint16_t desc;
> >> +
> >> +desc = rxq->rx_tail;
> >> +rxdp = &rxq->rx_ring[desc];
> >> +/* watch for changes in status bit */
> >> +*tail_desc_addr = &rxdp->wb.upper.status_error;
> >> +
> >> +/*
> >> + * we expect the DD bit to be set to 1 if this descriptor was already
> >> + * written to.
> >> + */
> >> +*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> >> +*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> >> +
> >
> > Seems have one issue about the byte endian:
> > Like for BIG endian:
> >           *expected = rte_bswap32(IXGBE_RXDADV_STAT_DD)
> >               !=
> >           *expected = rte_bswap64(IXGBE_RXDADV_STAT_DD)
> >
> > And in API 'rte_power_monitor', use uint64_t type to access the wake up
> > data:
> >
> > static inline void rte_power_monitor(const volatile void *p,
> > const uint64_t expected_value, const uint64_t value_mask,
> > const uint64_t tsc_timestamp)
> > {
> > if (value_mask) {
> > const uint64_t cur_value = *(const volatile uint64_t *)p;
> > const uint64_t masked = cur_value & value_mask;
> > /* if the masked value is already matching, abort */
> > if (masked == expected_value)
> > return;
> > }
> >
> >
> > So that we need the wake up address type like 16/32/64b ?
> 
> Endian differences strike again! You're right of course.
> 
> I suspect casting everything to CPU endinanness would fix it, would it not?

But need the same date type, if swap is needed for casting, then
(u64 a = rte_bswap32(1)) != (u64 b = rte_bswap64(1))

> 
> >
> >> --
> >> 2.17.1
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
  2020-10-12 13:13                             ` Burakov, Anatoly
@ 2020-10-13  9:45                               ` Burakov, Anatoly
  0 siblings, 0 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-13  9:45 UTC (permalink / raw)
  To: Ananyev, Konstantin, Ma, Liang J, dev; +Cc: Hunt, David, stephen

On 12-Oct-20 2:13 PM, Burakov, Anatoly wrote:
> On 12-Oct-20 1:50 PM, Ananyev, Konstantin wrote:
>>
>>>>>
>>>>>>>>>>>> Add two new power management intrinsics, and provide an
>>>>>>>>>>>> implementation
>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The 
>>>>>>>>>>>> instructions
>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet
>>>>>>>>>>>> widespread
>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>
>>>>>>>>>>>> The power management instructions provide an 
>>>>>>>>>>>> architecture-specific
>>>>>>>>>>>> function to either wait until a specified TSC timestamp is
>>>>>>>>>>>> reached, or
>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a
>>>>>>>>>>>> memory
>>>>>>>>>>>> location is written to. The monitor function also provides an
>>>>>>>>>>>> optional
>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has 
>>>>>>>>>>>> already
>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>
>>>>>>>>>>> I think what this API is missing - a function to wakeup sleeping
>>>>>>>>>>> core.
>>>>>>>>>>> If user can/should use some system call to achieve that, then at
>>>>>>>>>>> least
>>>>>>>>>>> it has to be clearly documented, even better some wrapper 
>>>>>>>>>>> provided.
>>>>>>>>>>
>>>>>>>>>> I don't think it's possible to do that without severely
>>>>>>>>>> overcomplicating
>>>>>>>>>> the intrinsic and its usage, because AFAIK the only way to 
>>>>>>>>>> wake up a
>>>>>>>>>> sleeping core would be to send some kind of interrupt to the
>>>>>>>>>> core, or
>>>>>>>>>> trigger a write to the cache-line in question.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>>>>> (on top of my head - membarrier() does that, might be there are
>>>>>>>>> some other syscalls too),
>>>>>>>>> or something hand-made. For hand-made, I wonder would something
>>>>>>>>> like that
>>>>>>>>> be safe and sufficient:
>>>>>>>>> uint64_t val = atomic_load(addr);
>>>>>>>>> CAS(addr, val, &val);
>>>>>>>>> ?
>>>>>>>>> Anyway, one way or another - I think ability to wakeup core we put
>>>>>>>>> to sleep
>>>>>>>>> have to be an essential part of this feature.
>>>>>>>>> As I understand linux kernel will limit max amount of sleep time
>>>>>>>>> for these instructions:
>>>>>>>>> https://lwn.net/Articles/790920/
>>>>>>>>> But relying just on that, seems too vague for me:
>>>>>>>>> - user can adjust that value
>>>>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>>>>> Konstantin
>>>>>>>>>
>>>>>>>>
>>>>>>>> This implies knowing the value the core is sleeping on.
>>>>>>>
>>>>>>> You don't the value to wait for, you just need an address.
>>>>>>> And you can make wakeup function to accept address as a parameter,
>>>>>>> same as monitor() does.
>>>>>>
>>>>>> Sorry, i meant the address. We don't know the address we're 
>>>>>> sleeping on.
>>>>>>
>>>>>>>
>>>>>>>> That's not
>>>>>>>> always the case - with this particular PMD power management 
>>>>>>>> scheme, we
>>>>>>>> get the address from the PMD and it stays inside the callback.
>>>>>>>
>>>>>>> That's fine - you can store address inside you callback metadata
>>>>>>> and do wakeup as part of _disable_ function.
>>>>>>>
>>>>>>
>>>>>> The address may be different, and by the time we access the 
>>>>>> address it
>>>>>> may become stale, so i don't see how that would help unless you're
>>>>>> suggesting to have some kind of synchronization mechanism there.
>>>>>
>>>>> Yes, we'll need something to sync here for sure.
>>>>> Sorry, I should say it straightway, to avoid further misunderstanding.
>>>>> Let say, associate a spin_lock with monitor(), by analogy with
>>>>> pthread_cond_wait().
>>>>> Konstantin
>>>>>
>>>>
>>>> The idea was to provide an intrinsic-like function - as in, raw
>>>> instruction call, without anything extra. We even added the 
>>>> masks/values
>>>> etc. only because there's no race-less way to combine UMONITOR/UMWAIT
>>>> without those.
>>>>
>>>> Perhaps we can provide a synchronize-able wrapper around it to avoid
>>>> adding overhead to calls that function but doesn't need the sync 
>>>> mechanism?
>>
>> Yes, might be two flavours, something like
>> rte_power_monitor() and rte_power_monitor_sync()
>> or whatever would be a better name.
>>
>>>>
>>>
>>> Also, how would having a spinlock help to synchronize? Are you
>>> suggesting we do UMWAIT on a spinlock address, or something to that 
>>> effect?
>>>
>>
>> I thought about something very similar to cond_wait() working model:
>>
>> /*
>>   * Caller has to obtain lock before calling that function.
>>   */
>> static inline int rte_power_monitor_sync(const volatile void *p,
>>                  const uint64_t expected_value, const uint64_t 
>> value_mask,
>>                  const uint32_t state, const uint64_t tsc_timestamp, 
>> rte_spinlock_t *lck)
>> {
>> /* do whatever preparations are needed */
>>                 ....
>> umonitor(p);
>>
>> if (value_mask != 0 && *((const uint64_t *)p) & value_mask == 
>> expected_value) {
>> return 0;
>>   }
>>
>> /* release lock and go to sleep */
>> rte_spinlock_unlock(lck);
>> rflags = umwait();
>>
>> /* grab lock back after wakeup */
>> rte_spinlock_lock(lck);
>>
>> /* do rest of processing */
>> ....
>> }
>>
>> /* similar go cond_signal */
>> static inline void rte_power_monitor_wakeup(volatile void *p)
>> {
>> uint64_t v;
>>
>> v = __atomic_load_n(p, __ATOMIC_RELAXED);
>> __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, 
>> __ATOMIC_RELAXED);
>> }
>>
>>
>> Now in librte_power:
>>
>> struct pmd_queue_cfg {
>>         /* to protect state and wait_addr */
>>         rte_spinlock_t lck;
>>         enum pmd_mgmt_state pwr_mgmt_state;
>>         void *wait_addr;
>>         /* rest fields */
>>        ....
>> } __rte_cache_aligned;
>>
>>
>> static uint16_t
>> rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
>>                  struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>>                  uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> {
>>
>>          struct pmd_queue_cfg *q_conf;
>>          q_conf = &port_cfg[port_id].queue_cfg[qidx];
>>
>>          if (unlikely(nb_rx == 0)) {
>>                  q_conf->empty_poll_stats++;
>>                  if (unlikely(q_conf->empty_poll_stats > 
>> EMPTYPOLL_MAX)) {
>>                          volatile void *target_addr;
>>                          uint64_t expected, mask;
>>                          uint16_t ret;
>>
>>           /* grab the lock and check the state */
>>                         rte_spinlock_lock(&q_conf->lck);
>>           If (q-conf->state == ENABLED) {
>>                          ret = rte_eth_get_wake_addr(port_id, qidx,
>>                                                      &target_addr, 
>> &expected, &mask);
>>            If (ret == 0) {
>> q_conf->wait_addr = target_addr;
>> rte_power_monitor(target_addr, ..., &q_conf->lck);
>>           }
>>            /* reset the wait_addr */
>>            q_conf->wait_addr = NULL;
>>           }
>>           rte_spinlock_unlock(&q_conf->lck);
>>           ....
>> }
>>
>> nt
>> rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>>                                  uint16_t port_id,
>>                                  uint16_t queue_id)
>> {
>> ...
>> /* grab the lock and change the state */
>>                 rte_spinlock_lock(&q_conf->lck);
>> queue_cfg->state = DISABLED;
>>
>> /* wakeup if necessary */
>> If (queue_cfg->wakeup_addr != NULL)
>> rte_power_monitor_wakeup(queue_cfg->wakeup_addr);
>>
>> rte_spinlock_unlock(&q_conf->lck);
>> ...
>> }
>>
> 
> Yeah, seems that i understood you correctly the first time then. I'm not 
> completely convinced that this overhead and complexity is worth the 
> trouble, to be honest. I mean, it's not like we're going to sleep 
> indefinitely, this isn't like pthread wait - the biggest sleep time i've 
> seen was around half a second and i'm not sure there is a use case for 
> enabling/disabling this functionality willy nilly ever 5 seconds.
> 

Back story: we've had a little internal chat and basically agreed to 
Konstantin's proposal, with slight modifications. That is, we need to be 
able to wake up the core because otherwise we have no deterministic way 
of stopping the sleeping RX path, however there is no need to expose 
this mechanism in a public API and it can be kept inside the power 
library instead.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API Anatoly Burakov
@ 2020-10-14  3:10         ` Guo, Jia
  2020-10-14  9:07           ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Guo, Jia @ 2020-10-14  3:10 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris


> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Anatoly Burakov
> Sent: Saturday, October 10, 2020 12:02 AM
> To: dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Thomas Monjalon
> <thomas@monjalon.net>; Yigit, Ferruh <ferruh.yigit@intel.com>; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Hunt, David
> <david.hunt@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
> Bruce <bruce.richardson@intel.com>; McDaniel, Timothy
> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
> Macnamara, Chris <chris.macnamara@intel.com>
> Subject: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power
> management API
> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add a simple API to allow getting address of next RX descriptor from the
> PMD, as well as release notes information.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v5:
>     - Bring function format in line with other functions in the file
>     - Ensure the API is supported by the driver before calling it (Konstantin)
> 
>  doc/guides/rel_notes/release_20_11.rst   | 16 ++++++++++++++
>  lib/librte_ethdev/rte_ethdev.c           | 17 ++++++++++++++
>  lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_version.map |  1 +
>  5 files changed, 86 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/release_20_11.rst
> b/doc/guides/rel_notes/release_20_11.rst
> index 808bdc4e54..e85af5d3e9 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -55,6 +55,11 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
> 
> +* **ethdev: add 1 new EXPERIMENTAL API for PMD power
> management.**
> +
> +  * ``rte_eth_get_wake_addr()``
> +  * add new eth_dev_ops ``get_wake_addr``
> +
>  * **Updated Broadcom bnxt driver.**
> 
>    Updated the Broadcom bnxt driver with new features and improvements,
> including:
> @@ -136,6 +141,17 @@ New Features
>    * Extern objects and functions can be plugged into the pipeline.
>    * Transaction-oriented table updates.
> 
> +* **Add PMD power management mechanism**
> +
> +  3 new Ethernet PMD power management mechanism is added through

" mechanisms are " please.

> + existing  RX callback infrastructure.
> +
> +  * Add power saving scheme based on UMWAIT instruction (x86 only)
> +  * Add power saving scheme based on ``rte_pause()``
> +  * Add power saving scheme based on frequency scaling through the
> + power library
> +  * Add new EXPERIMENTAL API
> ``rte_power_pmd_mgmt_queue_enable()``
> +  * Add new EXPERIMENTAL API
> ``rte_power_pmd_mgmt_queue_disable()``
> +

Could this doc be separate to other specific patch if it is not related with this patch?

> 
>  Removed Items
>  -------------
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 48d1333b17..352108f43c 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -4804,6 +4804,23 @@ rte_eth_tx_burst_mode_get(uint16_t port_id,
> uint16_t queue_id,
>  		       dev->dev_ops->tx_burst_mode_get(dev, queue_id,
> mode));  }
> 
> +int
> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +		volatile void **wake_addr, uint64_t *expected, uint64_t
> *mask) {
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +
> +	dev = &rte_eth_devices[port_id];
> +
> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -
> ENOTSUP);
> +
> +	return eth_err(port_id,
> +		dev->dev_ops->get_wake_addr(dev->data-
> >rx_queues[queue_id],
> +			wake_addr, expected, mask));
> +}
> +
>  int
>  rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>  			     struct rte_ether_addr *mc_addr_set, diff --git
> a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h index
> d2bf74f128..a6cfe3cd57 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -4014,6 +4014,30 @@ __rte_experimental  int
> rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>  	struct rte_eth_burst_mode *mode);
> 
> +/**
> + * Retrieve the wake up address from specific queue
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Tx queue on the Ethernet device for which information
> + *   will be retrieved.
> + * @param wake_addr
> + *   The pointer point to the address which is used for monitoring.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + *
> + * @return
> + *   - 0: Success.
> + *   -EINVAL: Failed to get wake address.
> + */

Is that "-EINVAL " is the only error value which will be return?

> +__rte_experimental
> +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
> +			  volatile void **wake_addr,
> +			  uint64_t *expected, uint64_t *mask);
> +
>  /**
>   * Retrieve device registers and register attributes (number of registers and
>   * register size)
> diff --git a/lib/librte_ethdev/rte_ethdev_driver.h
> b/lib/librte_ethdev/rte_ethdev_driver.h
> index c3062c246c..935d46f25c 100644
> --- a/lib/librte_ethdev/rte_ethdev_driver.h
> +++ b/lib/librte_ethdev/rte_ethdev_driver.h
> @@ -574,6 +574,31 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>  	 uint16_t nb_tx_desc,
>  	 const struct rte_eth_hairpin_conf *hairpin_conf);
> 
> +/**
> + * @internal
> + * Get the Wake up address.
> + *
> + * @param rxq
> + *   Ethdev queue pointer.
> + * @param tail_desc_addr
> + *   The pointer point to descriptor address var.
> + * @param expected
> + *   The pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   The pointer point to comparison bitmask for the expected value.
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success.
> + * @retval -EINVAL
> + *   Failed to get descriptor address.
> + */

The question is the same as above.

> +typedef int (*eth_get_wake_addr_t)
> +	(void *rxq, volatile void **tail_desc_addr,
> +	 uint64_t *expected, uint64_t *mask);
> +
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet
> driver.
>   */
> @@ -713,6 +738,9 @@ struct eth_dev_ops {
>  	/**< Set up device RX hairpin queue. */
>  	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>  	/**< Set up device TX hairpin queue. */
> +	eth_get_wake_addr_t get_wake_addr;
> +	/**< Get wake up address. */
> +
>  };
> 
>  /**
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map
> b/lib/librte_ethdev/rte_ethdev_version.map
> index c95ef5157a..3cb2093980 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -229,6 +229,7 @@ EXPERIMENTAL {
>  	# added in 20.11
>  	rte_eth_link_speed_to_str;
>  	rte_eth_link_to_str;
> +	rte_eth_get_wake_addr;
>  };
> 
>  INTERNAL {
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 07/10] net/i40e: implement power management API
  2020-10-09 16:02       ` [dpdk-dev] [PATCH v5 07/10] net/i40e: " Anatoly Burakov
@ 2020-10-14  3:19         ` Guo, Jia
  2020-10-14  9:08           ` Burakov, Anatoly
  0 siblings, 1 reply; 421+ messages in thread
From: Guo, Jia @ 2020-10-14  3:19 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Ma, Liang J, Xing, Beilei, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, thomas, McDaniel, Timothy, Eads,
	Gage, Macnamara,  Chris


> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Saturday, October 10, 2020 12:02 AM
> To: dev@dpdk.org
> Cc: Ma, Liang J <liang.j.ma@intel.com>; Xing, Beilei <beilei.xing@intel.com>;
> Guo, Jia <jia.guo@intel.com>; Hunt, David <david.hunt@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>;
> thomas@monjalon.net; McDaniel, Timothy <timothy.mcdaniel@intel.com>;
> Eads, Gage <gage.eads@intel.com>; Macnamara, Chris
> <chris.macnamara@intel.com>
> Subject: [PATCH v5 07/10] net/i40e: implement power management API
> 
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Implement support for the power management API by implementing a
> `get_wake_addr` function that will return an address of an RX ring's status bit.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  drivers/net/i40e/i40e_ethdev.c |  1 +
>  drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
>  drivers/net/i40e/i40e_rxtx.h   |  2 ++
>  3 files changed, 26 insertions(+)
> 
> diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
> index 943cfe71dc..cab86f8ec9 100644
> --- a/drivers/net/i40e/i40e_ethdev.c
> +++ b/drivers/net/i40e/i40e_ethdev.c
> @@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
>  	.mtu_set                      = i40e_dev_mtu_set,
>  	.tm_ops_get                   = i40e_tm_ops_get,
>  	.tx_done_cleanup              = i40e_tx_done_cleanup,
> +	.get_wake_addr	              = i40e_get_wake_addr,
>  };
> 
>  /* store statistics names and its offset in stats structure */ diff --git
> a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index
> 322fc1ed75..c17f27292f 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -71,6 +71,29 @@
>  #define I40E_TX_OFFLOAD_NOTSUP_MASK \
>  		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
> 
> +int
> +i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *mask)
> +{
> +	struct i40e_rx_queue *rxq = rx_queue;
> +	volatile union i40e_rx_desc *rxdp;
> +	uint16_t desc;
> +
> +	desc = rxq->rx_tail;
> +	rxdp = &rxq->rx_ring[desc];
> +	/* watch for changes in status bit */
> +	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
> +
> +	/*
> +	 * we expect the DD bit to be set to 1 if this descriptor was already
> +	 * written to.
> +	 */
> +	*expected = rte_cpu_to_le_64(1 <<
> I40E_RX_DESC_STATUS_DD_SHIFT);
> +	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> +
> +	return 0;

Suppose that it will always success to get wake addr in i40e, right?

> +}
> +
>  static inline void
>  i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc
> *rxdp)  { diff --git a/drivers/net/i40e/i40e_rxtx.h
> b/drivers/net/i40e/i40e_rxtx.h index 57d7b4160b..f23a2073e3 100644
> --- a/drivers/net/i40e/i40e_rxtx.h
> +++ b/drivers/net/i40e/i40e_rxtx.h
> @@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void
> *rx_queue,
>  	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);  uint16_t
> i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
>  	uint16_t nb_pkts);
> +int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr,
> +		uint64_t *expected, uint64_t *value);
> 
>  /* For each value it means, datasheet of hardware can tell more details
>   *
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 421+ messages in thread

* Re: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power management API
  2020-10-14  3:10         ` Guo, Jia
@ 2020-10-14  9:07           ` Burakov, Anatoly
  2020-10-14  9:15             ` Guo, Jia
  2020-10-14  9:23             ` Bruce Richardson
  0 siblings, 2 replies; 421+ messages in thread
From: Burakov, Anatoly @ 2020-10-14  9:07 UTC (permalink / raw)
  To: Guo, Jia, dev
  Cc: Ma, Liang J, Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko,
	Ray Kinsella, Neil Horman, Hunt, David, Ananyev, Konstantin,
	jerinjacobk, Richardson, Bruce, McDaniel, Timothy, Eads, Gage,
	Macnamara, Chris

On 14-Oct-20 4:10 AM, Guo, Jia wrote:
> 
>> -----Original Message-----
>> From: dev <dev-bounces@dpdk.org> On Behalf Of Anatoly Burakov
>> Sent: Saturday, October 10, 2020 12:02 AM
>> To: dev@dpdk.org
>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Thomas Monjalon
>> <thomas@monjalon.net>; Yigit, Ferruh <ferruh.yigit@intel.com>; Andrew
>> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
>> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Hunt, David
>> <david.hunt@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>; jerinjacobk@gmail.com; Richardson,
>> Bruce <bruce.richardson@intel.com>; McDaniel, Timothy
>> <timothy.mcdaniel@intel.com>; Eads, Gage <gage.eads@intel.com>;
>> Macnamara, Chris <chris.macnamara@intel.com>
>> Subject: [dpdk-dev] [PATCH v5 04/10] ethdev: add simple power
>> management API
>>
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Add a simple API to allow getting address of next RX descriptor from the
>> PMD, as well as release notes information.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---

Hi Jia,

Thanks for your review. Responses below.

>>
>> Notes:
>>      v5:
>>      - Bring function format in line with other functions in the file
>>      - Ensure the API is supported by the driver before calling it (Konstantin)
>>
>>   doc/guides/rel_notes/release_20_11.rst   | 16 ++++++++++++++
>>   lib/librte_ethdev/rte_ethdev.c           | 17 ++++++++++++++
>>   lib/librte_ethdev/rte_ethdev.h           | 24 ++++++++++++++++++++
>>   lib/librte_ethdev/rte_ethdev_driver.h    | 28 ++++++++++++++++++++++++
>>   lib/librte_ethdev/rte_ethdev_version.map |  1 +
>>   5 files changed, 86 insertions(+)
>>
>> diff --git a/doc/guides/rel_notes/release_20_11.rst
>> b/doc/guides/rel_notes/release_20_11.rst
>> index 808bdc4e54..e85af5d3e9 100644
>> --- a/doc/guides/rel_notes/release_20_11.rst
>> +++ b/doc/guides/rel_notes/release_20_11.rst
>> @@ -55,6 +55,11 @@ New Features
>>        Also, make sure to start the actual text at the margin.
>>        =======================================================
>>
>> +* **ethdev: add 1 new EXPERIMENTAL API for PMD power
>> management.**
>> +
>> +  * ``rte_eth_get_wake_addr()``
>> +  * add new eth_dev_ops ``get_wake_addr``
>> +
>>   * **Updated Broadcom bnxt driver.**
>>
>>     Updated the Broadcom bnxt driver with new features and improvements,
>> including:
>> @@ -136,6 +141,17 @@ New Features
>>     * Extern objects and functions can be plugged into the pipeline.
>>     * Transaction-oriented table updates.
>>
>> +* **Add PMD power management mechanism**
>> +
>> +  3 new Ethernet PMD power management mechanism is added through
> 
> " mechanisms are " please.
> 
>> + existing  RX callback infrastructure.
>> +
>> +  * Add power saving scheme based on UMWAIT instruction (x86 only)
>> +  * Add power saving scheme based on ``rte_pause()``
>> +  * Add power saving scheme based on frequency scaling through the
>> + power library
>> +  * Add new EXPERIMENTAL API
>> ``rte_power_pmd_mgmt_queue_enable()``
>> +  * Add new EXPERIMENTAL API
>> ``rte_power_pmd_mgmt_queue_disable()``
>> +
> 
> Could this doc be separate to other specific patch if it is not related with this patch?

It is related - it's the doc changes that add mention of this API. I was 
under the impression current policy was having doc updates in the same 
patch as the changes made?

> 
>>
>>   Removed Items
>>   -------------
>> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
>> index 48d1333b17..352108f43c 100644
>> --- a/lib/librte_ethdev/rte_ethdev.c
>> +++ b/lib/librte_ethdev/rte_ethdev.c
>> @@ -4804,6 +4804,23 @@ rte_eth_tx_burst_mode_get(uint16_t port_id,
>> uint16_t queue_id,
>>   		       dev->dev_ops->tx_burst_mode_get(dev, queue_id,
>> mode));  }
>>
>> +int
>> +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id,
>> +		volatile void **wake_addr, uint64_t *expected, uint64_t
>> *mask) {
>> +	struct rte_eth_dev *dev;
>> +
>> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
>> +
>> +	dev = &rte_eth_devices[port_id];
>> +
>> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -
>> ENOTSUP);
>> +
>> +	return eth_err(port_id,
>> +		dev->dev_ops->get_wake_addr(dev->data-
>>> rx_queues[queue_id],
>> +			wake_addr, expected, mask));
>> +}
>> +
>>   int
>>   rte_eth_dev_set_mc_addr_list(uint16_t port_id,
>>   			     struct rte_ether_addr *mc_addr_set, diff --git
>> a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h index
>> d2bf74f128..a6cfe3cd57 100644
>> --- a/lib/librte_ethdev/rte_ethdev.h
>> +++ b/lib/librte_ethdev/rte_ethdev.h
>> @@ -4014,6 +4014,30 @@ __rte_experimental  int
>> rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>>   	struct rte_eth_burst_mode *mode);
>>
>> +/**
>> + * Retrieve the wake up address from specific queue
>> + *
>> + * @param port_id
>> + *   The port identifier of the Ethernet device.
>> + * @param queue_id
>> + *   The Tx queue on the Ethernet device for which information
>> + *   will be retrieved.
>> + * @param wake_addr
>> + *   The pointer point to the address which is used for monitoring.
>> + * @param expec