DPDK patches and discussions
 help / color / Atom feed
* [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices
@ 2020-05-27 17:02 Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
                   ` (7 more replies)
  0 siblings, 8 replies; 56+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, liang.j.ma

This patchset proposes a simple API for Ethernet drivers
to cause the CPU to enter a power-optimized state while
waiting for packets to arrive, along with a set of
(hopefully generic) intrinsics that facilitate that. This
is achieved through cooperation with the NIC driver that
will allow us to know address of the next NIC RX ring
packet descriptor, and wait for writes on it.

On IA, this is achieved through using UMONITOR/UMWAIT
instructions. They are used in their raw opcode form
because there is no widespread compiler support for
them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen
to implement similar instructions.

To achieve power savings, there is a very simple mechanism
used: we're counting empty polls, and if a certain threshold
is reached, we get the address of next RX ring descriptor
from the NIC driver, arm the monitoring hardware, and
enter a power-optimized state. We will then wake up when
either a timeout happens, or a write happens (or generally
whenever CPU feels like waking up - this is platform-
specific), and proceed as normal. The empty poll counter is
reset whenever we actually get packets, so we only go to
sleep when we know nothing is going on.

Why are we putting it into ethdev as opposed to leaving
this up to the application? Our customers specifically
requested a way to do it wit minimal changes to the
application code. The current approach allows to just
flip a switch and automagically have power savings.

There are certain limitations in this patchset right now:
- Currently, only 1:1 core to queue mapping is supported,
  meaning that each lcore must at most handle RX on a
  single queue
- Currently, power management is enabled per-port, not
  per-queue
- There is potential to greatly increase TX latency if we
  are buffering things, and go to sleep before sending
  packets
- The API is not perfect and could use some improvement
  and discussion
- The API doesn't extend to other device types
- The intrinsics are platform-specific, so ethdev has
  some platform-specific code in it
- Support was only implemented for devices using
  net/ixgbe, net/i40e and net/ice drivers

Hopefully this would generate enough feedback to clear
a path forward!

Anatoly Burakov (6):
  eal: add power management intrinsics
  ethdev: add simple power management API
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  app/testpmd: add command for power management on a port

 app/test-pmd/cmdline.c                        |  48 +++++++
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  23 +++
 drivers/net/i40e/i40e_rxtx.h                  |   2 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  23 +++
 drivers/net/ice/ice_rxtx.h                    |   2 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  22 +++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
 .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 lib/librte_ethdev/rte_ethdev.c                |  39 +++++
 lib/librte_ethdev/rte_ethdev.h                |  70 +++++++++
 lib/librte_ethdev/rte_ethdev_core.h           |  41 +++++-
 lib/librte_ethdev/rte_ethdev_version.map      |   4 +
 20 files changed, 480 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

-- 
2.17.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-28 11:39   ` Ananyev, Konstantin
  2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 56+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, Konstantin Ananyev, david.hunt, liang.j.ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 203 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..8646c4ac16
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index bc73ec2c5c..b54a2be4f6 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -59,6 +59,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..94d6a43763 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
 	RTE_CPUFLAG_EM64T,                  /**< EM64T */
 
+	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
 	/* (EAX 80000007h) EDX features */
 	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
 
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..a0522400fb
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,134 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+	rte_mb();
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC 2/6] ethdev: add simple power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-28 12:15   ` Ananyev, Konstantin
  2020-05-27 17:02 ` [dpdk-dev] [RFC 3/6] net/ixgbe: implement " Anatoly Burakov
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 56+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev
  Cc: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella,
	Neil Horman, david.hunt, liang.j.ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 queue use case as there is no
coordination between queues/cores in ethdev.

The TSC timestamp is automatically calculated using current link
speed and RX descriptor ring size, such that the sleep time is
not longer than it would take for a NIC to fill its entire RX
descriptor ring.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.c           | 39 +++++++++++++
 lib/librte_ethdev/rte_ethdev.h           | 70 ++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_core.h      | 41 +++++++++++++-
 lib/librte_ethdev/rte_ethdev_version.map |  4 ++
 4 files changed, 152 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 8e10a6fc36..0be5ecfc11 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -16,6 +16,7 @@
 #include <netinet/in.h>
 
 #include <rte_byteorder.h>
+#include <rte_cpuflags.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_interrupts.h>
@@ -5053,6 +5054,44 @@ rte_eth_dev_pool_ops_supported(uint16_t port_id, const char *pool)
 	return (*dev->dev_ops->pool_ops_supported)(dev, pool);
 }
 
+int
+rte_eth_dev_power_mgmt_enable(uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+		return -ENOTSUP;
+
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+		sizeof(struct rte_eth_ep_stat) * RTE_MAX_QUEUES_PER_PORT,
+		0, dev->data->numa_node);
+
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_eth_dev_power_mgmt_disable(uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* rte_free ignores NULL so safe to call without checks */
+	rte_free(dev->empty_poll_stats);
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+	return 0;
+}
+
 /**
  * A set of values to describe the possible states of a switch domain.
  */
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index a49242bcd2..b8318f7e91 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -666,6 +667,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1490,6 +1492,16 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+/**
+ * Possible power managment states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
@@ -4302,6 +4314,38 @@ __rte_experimental
 int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
 				       struct rte_eth_hairpin_cap *cap);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_enable(uint16_t port_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_disable(uint16_t port_id);
+
 #include <rte_ethdev_core.h>
 
 /**
@@ -4417,6 +4461,32 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 		} while (cb != NULL);
 	}
 #endif
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+			dev->empty_poll_stats[queue_id].num++;
+			if (unlikely(dev->empty_poll_stats[queue_id].num >
+					ETH_EMPTYPOLL_MAX)) {
+				volatile void *target_addr;
+				uint64_t expected, mask;
+				int ret;
+
+				/*
+				 * get address of next descriptor in the RX
+				 * ring for this queue, as well as expected
+				 * value and a mask.
+				 */
+				ret = (*dev->dev_ops->next_rx_desc)
+					(dev->data->rx_queues[queue_id],
+					 &target_addr, &expected, &mask);
+				if (ret == 0)
+					/* -1ULL is maximum value for TSC */
+					rte_power_monitor(target_addr,
+							  expected, mask,
+							  0, -1ULL);
+			}
+		} else
+			dev->empty_poll_stats[queue_id].num = 0;
+	}
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
 	return nb_rx;
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd418..4e23d465f0 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,27 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ *
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +773,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +791,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +838,14 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	uint32_t reserved_32;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	void *reserved_ptrs[3];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index 7155056045..141361823d 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -241,4 +241,8 @@ EXPERIMENTAL {
 	__rte_ethdev_trace_rx_burst;
 	__rte_ethdev_trace_tx_burst;
 	rte_flow_get_aged_flows;
+
+	# added in 20.08
+	rte_eth_dev_power_mgmt_disable;
+	rte_eth_dev_power_mgmt_enable;
 };
-- 
2.17.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC 3/6] net/ixgbe: implement power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
@ 2020-05-27 17:02 ` " Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 4/6] net/i40e: " Anatoly Burakov
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 56+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Wei Zhao, Jeff Guo, david.hunt, liang.j.ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index a4e5c539de..190d11d98d 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -605,6 +605,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 2e20e18c7a..ef2fb5fca9 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 20a8b291d4..6c35966c78 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC 4/6] net/i40e: implement power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (2 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 3/6] net/ixgbe: implement " Anatoly Burakov
@ 2020-05-27 17:02 ` " Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 5/6] net/ice: " Anatoly Burakov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 56+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Beilei Xing, Jeff Guo, david.hunt, liang.j.ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 749d85f544..f3ce54911b 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -526,6 +526,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5e7c86ed82..76dfbb2098 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 8f11f011a7..72d810475b 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -245,6 +245,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC 5/6] net/ice: implement power management API
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (3 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 4/6] net/i40e: " Anatoly Burakov
@ 2020-05-27 17:02 ` " Anatoly Burakov
  2020-05-27 17:02 ` [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port Anatoly Burakov
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 56+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Qiming Yang, Wenzhuo Lu, david.hunt, liang.j.ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index d5110c4392..db8269a548 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -219,6 +219,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 1c9f31efdf..80fd6bd134 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d04..7eb6fa904e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (4 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 5/6] net/ice: " Anatoly Burakov
@ 2020-05-27 17:02 ` Anatoly Burakov
  2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  7 siblings, 0 replies; 56+ messages in thread
From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw)
  To: dev; +Cc: Wenzhuo Lu, Beilei Xing, Bernard Iremonger, david.hunt, liang.j.ma

A quick-and-dirty testpmd command to enable power management on
a specific port.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 app/test-pmd/cmdline.c | 48 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 996a498768..e3a5e19485 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -1773,6 +1773,53 @@ cmdline_parse_inst_t cmd_config_speed_specific = {
 	},
 };
 
+/* *** enable power management for specific port *** */
+struct cmd_port_pmgmt {
+	cmdline_fixed_string_t port;
+	portid_t id;
+	cmdline_fixed_string_t pmgmt;
+	cmdline_fixed_string_t on;
+};
+
+static void
+cmd_port_pmgmt_parsed(void *parsed_result,
+				__rte_unused struct cmdline *cl,
+				__rte_unused void *data)
+{
+	struct cmd_port_pmgmt *res = parsed_result;
+
+	if (port_id_is_invalid(res->id, ENABLED_WARN))
+		return;
+
+	if (!strcmp(res->on, "on"))
+		rte_eth_dev_power_mgmt_enable(res->id);
+	else if (!strcmp(res->on, "off"))
+		rte_eth_dev_power_mgmt_disable(res->id);
+}
+
+
+cmdline_parse_token_string_t cmd_port_pmgmt_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_pmgmt, port, "port");
+cmdline_parse_token_num_t cmd_port_pmgmt_id =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_pmgmt, id, UINT16);
+cmdline_parse_token_string_t cmd_port_pmgmt_item1 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_pmgmt, pmgmt, "power-mgmt");
+cmdline_parse_token_string_t cmd_port_pmgmt_value1 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_pmgmt, on, "on#off");
+
+cmdline_parse_inst_t cmd_port_pmgmt = {
+	.f = cmd_port_pmgmt_parsed,
+	.data = NULL,
+	.help_str = "port <port_id> power-mgmt on|off",
+	.tokens = {
+		(void *)&cmd_port_pmgmt_port,
+		(void *)&cmd_port_pmgmt_id,
+		(void *)&cmd_port_pmgmt_item1,
+		(void *)&cmd_port_pmgmt_value1,
+		NULL,
+	},
+};
+
 /* *** configure loopback for all ports *** */
 struct cmd_config_loopback_all {
 	cmdline_fixed_string_t port;
@@ -19692,6 +19739,7 @@ cmdline_parse_ctx_t main_ctx[] = {
 	(cmdline_parse_inst_t *)&cmd_show_set_raw,
 	(cmdline_parse_inst_t *)&cmd_show_set_raw_all,
 	(cmdline_parse_inst_t *)&cmd_config_tx_dynf_specific,
+	(cmdline_parse_inst_t *)&cmd_port_pmgmt,
 	NULL,
 };
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (5 preceding siblings ...)
  2020-05-27 17:02 ` [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port Anatoly Burakov
@ 2020-05-27 17:33 ` Jerin Jacob
  2020-05-27 20:57   ` Stephen Hemminger
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  7 siblings, 1 reply; 56+ messages in thread
From: Jerin Jacob @ 2020-05-27 17:33 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dpdk-dev, David Hunt, Liang Ma

On Wed, May 27, 2020 at 10:32 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> This patchset proposes a simple API for Ethernet drivers
> to cause the CPU to enter a power-optimized state while
> waiting for packets to arrive, along with a set of
> (hopefully generic) intrinsics that facilitate that. This
> is achieved through cooperation with the NIC driver that
> will allow us to know address of the next NIC RX ring
> packet descriptor, and wait for writes on it.
>
> On IA, this is achieved through using UMONITOR/UMWAIT
> instructions. They are used in their raw opcode form
> because there is no widespread compiler support for
> them yet. Still, the API is made generic enough to
> hopefully support other architectures, if they happen
> to implement similar instructions.
>
> To achieve power savings, there is a very simple mechanism
> used: we're counting empty polls, and if a certain threshold
> is reached, we get the address of next RX ring descriptor
> from the NIC driver, arm the monitoring hardware, and
> enter a power-optimized state. We will then wake up when
> either a timeout happens, or a write happens (or generally
> whenever CPU feels like waking up - this is platform-
> specific), and proceed as normal. The empty poll counter is
> reset whenever we actually get packets, so we only go to
> sleep when we know nothing is going on.
>
> Why are we putting it into ethdev as opposed to leaving
> this up to the application? Our customers specifically
> requested a way to do it wit minimal changes to the
> application code. The current approach allows to just
> flip a switch and automagically have power savings.
>
> There are certain limitations in this patchset right now:
> - Currently, only 1:1 core to queue mapping is supported,
>   meaning that each lcore must at most handle RX on a
>   single queue
> - Currently, power management is enabled per-port, not
>   per-queue
> - There is potential to greatly increase TX latency if we
>   are buffering things, and go to sleep before sending
>   packets
> - The API is not perfect and could use some improvement
>   and discussion
> - The API doesn't extend to other device types
> - The intrinsics are platform-specific, so ethdev has
>   some platform-specific code in it
> - Support was only implemented for devices using
>   net/ixgbe, net/i40e and net/ice drivers
>
> Hopefully this would generate enough feedback to clear
> a path forward!

Just for my understanding:

How/Is this solution is superior than Rx queue interrupt based scheme that
applied in l3fwd-power?

What I meant by superior here, as an example,
a)Is there any power savings in mill watt vs interrupt scheme?
b) Is there improvement on time reduction between switching from/to a
different state
(i.e how fast it can move from low power state to full power state) vs
interrupt scheme.
etc

or This just for just pushing all the logic to ethdev so that
applications can be transparent?


>
> Anatoly Burakov (6):
>   eal: add power management intrinsics
>   ethdev: add simple power management API
>   net/ixgbe: implement power management API
>   net/i40e: implement power management API
>   net/ice: implement power management API
>   app/testpmd: add command for power management on a port
>
>  app/test-pmd/cmdline.c                        |  48 +++++++
>  drivers/net/i40e/i40e_ethdev.c                |   1 +
>  drivers/net/i40e/i40e_rxtx.c                  |  23 +++
>  drivers/net/i40e/i40e_rxtx.h                  |   2 +
>  drivers/net/ice/ice_ethdev.c                  |   1 +
>  drivers/net/ice/ice_rxtx.c                    |  23 +++
>  drivers/net/ice/ice_rxtx.h                    |   2 +
>  drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
>  drivers/net/ixgbe/ixgbe_rxtx.c                |  22 +++
>  drivers/net/ixgbe/ixgbe_rxtx.h                |   2 +
>  .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
>  lib/librte_eal/x86/rte_cpuflags.c             |   2 +
>  lib/librte_ethdev/rte_ethdev.c                |  39 +++++
>  lib/librte_ethdev/rte_ethdev.h                |  70 +++++++++
>  lib/librte_ethdev/rte_ethdev_core.h           |  41 +++++-
>  lib/librte_ethdev/rte_ethdev_version.map      |   4 +
>  20 files changed, 480 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
>
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices
  2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
@ 2020-05-27 20:57   ` Stephen Hemminger
  0 siblings, 0 replies; 56+ messages in thread
From: Stephen Hemminger @ 2020-05-27 20:57 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Anatoly Burakov, dpdk-dev, David Hunt, Liang Ma

On Wed, 27 May 2020 23:03:59 +0530
Jerin Jacob <jerinjacobk@gmail.com> wrote:

> On Wed, May 27, 2020 at 10:32 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
> >
> > This patchset proposes a simple API for Ethernet drivers
> > to cause the CPU to enter a power-optimized state while
> > waiting for packets to arrive, along with a set of
> > (hopefully generic) intrinsics that facilitate that. This
> > is achieved through cooperation with the NIC driver that
> > will allow us to know address of the next NIC RX ring
> > packet descriptor, and wait for writes on it.
> >
> > On IA, this is achieved through using UMONITOR/UMWAIT
> > instructions. They are used in their raw opcode form
> > because there is no widespread compiler support for
> > them yet. Still, the API is made generic enough to
> > hopefully support other architectures, if they happen
> > to implement similar instructions.
> >
> > To achieve power savings, there is a very simple mechanism
> > used: we're counting empty polls, and if a certain threshold
> > is reached, we get the address of next RX ring descriptor
> > from the NIC driver, arm the monitoring hardware, and
> > enter a power-optimized state. We will then wake up when
> > either a timeout happens, or a write happens (or generally
> > whenever CPU feels like waking up - this is platform-
> > specific), and proceed as normal. The empty poll counter is
> > reset whenever we actually get packets, so we only go to
> > sleep when we know nothing is going on.
> >
> > Why are we putting it into ethdev as opposed to leaving
> > this up to the application? Our customers specifically
> > requested a way to do it wit minimal changes to the
> > application code. The current approach allows to just
> > flip a switch and automagically have power savings.
> >
> > There are certain limitations in this patchset right now:
> > - Currently, only 1:1 core to queue mapping is supported,
> >   meaning that each lcore must at most handle RX on a
> >   single queue
> > - Currently, power management is enabled per-port, not
> >   per-queue
> > - There is potential to greatly increase TX latency if we
> >   are buffering things, and go to sleep before sending
> >   packets
> > - The API is not perfect and could use some improvement
> >   and discussion
> > - The API doesn't extend to other device types
> > - The intrinsics are platform-specific, so ethdev has
> >   some platform-specific code in it
> > - Support was only implemented for devices using
> >   net/ixgbe, net/i40e and net/ice drivers
> >
> > Hopefully this would generate enough feedback to clear
> > a path forward!  
> 
> Just for my understanding:
> 
> How/Is this solution is superior than Rx queue interrupt based scheme that
> applied in l3fwd-power?
> 
> What I meant by superior here, as an example,
> a)Is there any power savings in mill watt vs interrupt scheme?
> b) Is there improvement on time reduction between switching from/to a
> different state
> (i.e how fast it can move from low power state to full power state) vs
> interrupt scheme.
> etc
> 
> or This just for just pushing all the logic to ethdev so that
> applications can be transparent?
> 

The interrupt scheme is going to get better power management since
the core can go to WAIT. This scheme does look interesting in theory
since it will be lower latency.

but has a number of issues:
  * changing drivers
  * can not multiplex multiple queues per core; you are assuming
    a certain threading model
  * what if thread is preempted
  * what about thread in a VM
  * platform specific: ARM and x86 have different semantics here


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
@ 2020-05-28 11:39   ` Ananyev, Konstantin
  2020-05-28 14:40     ` Burakov, Anatoly
  0 siblings, 1 reply; 56+ messages in thread
From: Ananyev, Konstantin @ 2020-05-28 11:39 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli

Hi Anatoly,

> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.

Recently ARM guys introduced new generic API
for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
Probably would make sense to unite both APIs into something common
and HW transparent. 
Konstantin

> 
> Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
>  lib/librte_eal/x86/rte_cpuflags.c             |   2 +
>  6 files changed, 203 insertions(+)
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
> 
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..8646c4ac16
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp);
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index bc73ec2c5c..b54a2be4f6 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -59,6 +59,7 @@ generic_headers = files(
>  	'generic/rte_memcpy.h',
>  	'generic/rte_pause.h',
>  	'generic/rte_prefetch.h',
> +	'generic/rte_power_intrinsics.h',
>  	'generic/rte_rwlock.h',
>  	'generic/rte_spinlock.h',
>  	'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>  	'rte_io.h',
>  	'rte_memcpy.h',
>  	'rte_prefetch.h',
> +	'rte_power_intrinsics.h',
>  	'rte_pause.h',
>  	'rte_rtm.h',
>  	'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d1..94d6a43763 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
>  	RTE_CPUFLAG_EM64T,                  /**< EM64T */
> 
> +	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
>  	/* (EAX 80000007h) EDX features */
>  	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
> 
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..a0522400fb
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,134 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	uint64_t rflags;
> +
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +	rte_mb();
> +	if (value_mask) {
> +		const uint64_t cur_value = *(const volatile uint64_t *)p;
> +		const uint64_t masked = cur_value & value_mask;
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return 0;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
> +		/*
> +		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
> +		 * onto the stack, then pop them back into `rflags` so that
> +		 * we can read it.
> +		 */
> +		"pushf;\n"
> +		"pop %0;\n"
> +		: "=r"(rflags)
> +		: "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction. For more information about its usage,
> + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
> + * Manual.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	uint64_t rflags;
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
> +		     /*
> +		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
> +		      * onto the stack, then pop them back into `rflags` so that
> +		      * we can read it.
> +		      */
> +		     "pushf;\n"
> +		     "pop %0;\n"
> +		     : "=r"(rflags)
> +		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 30439e7951..0325c4b93b 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
>  	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
>  	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
> 
> +	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
> +
>  	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
>  	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
> 
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 2/6] ethdev: add simple power management API
  2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
@ 2020-05-28 12:15   ` Ananyev, Konstantin
  0 siblings, 0 replies; 56+ messages in thread
From: Ananyev, Konstantin @ 2020-05-28 12:15 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko, Ray Kinsella,
	Neil Horman, Hunt, David, Ma, Liang J

> 
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API is limited to 1 core 1 queue use case as there is no
> coordination between queues/cores in ethdev.
> 
> The TSC timestamp is automatically calculated using current link
> speed and RX descriptor ring size, such that the sleep time is
> not longer than it would take for a NIC to fill its entire RX
> descriptor ring.
> 
> Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_ethdev/rte_ethdev.c           | 39 +++++++++++++
>  lib/librte_ethdev/rte_ethdev.h           | 70 ++++++++++++++++++++++++
>  lib/librte_ethdev/rte_ethdev_core.h      | 41 +++++++++++++-
>  lib/librte_ethdev/rte_ethdev_version.map |  4 ++
>  4 files changed, 152 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 8e10a6fc36..0be5ecfc11 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -16,6 +16,7 @@
>  #include <netinet/in.h>
> 
>  #include <rte_byteorder.h>
> +#include <rte_cpuflags.h>
>  #include <rte_log.h>
>  #include <rte_debug.h>
>  #include <rte_interrupts.h>
> @@ -5053,6 +5054,44 @@ rte_eth_dev_pool_ops_supported(uint16_t port_id, const char *pool)
>  	return (*dev->dev_ops->pool_ops_supported)(dev, pool);
>  }
> 
> +int
> +rte_eth_dev_power_mgmt_enable(uint16_t port_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
> +		return -ENOTSUP;
> +
> +	/* allocate memory for empty poll stats */
> +	dev->empty_poll_stats = rte_malloc_socket(NULL,
> +		sizeof(struct rte_eth_ep_stat) * RTE_MAX_QUEUES_PER_PORT,
> +		0, dev->data->numa_node);
> +
> +	if (dev->empty_poll_stats == NULL)
> +		return -ENOMEM;
> +
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
> +	return 0;
> +}
> +
> +int
> +rte_eth_dev_power_mgmt_disable(uint16_t port_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	/* rte_free ignores NULL so safe to call without checks */
> +	rte_free(dev->empty_poll_stats);
> +
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> +	return 0;
> +}
> +
>  /**
>   * A set of values to describe the possible states of a switch domain.
>   */
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index a49242bcd2..b8318f7e91 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -157,6 +157,7 @@ extern "C" {
>  #include <rte_common.h>
>  #include <rte_config.h>
>  #include <rte_ether.h>
> +#include <rte_power_intrinsics.h>
> 
>  #include "rte_ethdev_trace_fp.h"
>  #include "rte_dev_info.h"
> @@ -666,6 +667,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
>  /** Maximum nb. of vlan per mirror rule */
>  #define ETH_MIRROR_MAX_VLANS       64
> 
> +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
>  #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
>  #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
>  #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
> @@ -1490,6 +1492,16 @@ enum rte_eth_dev_state {
>  	RTE_ETH_DEV_REMOVED,
>  };
> 
> +/**
> + * Possible power managment states of an ethdev port.
> + */
> +enum rte_eth_dev_power_mgmt_state {
> +	/** Device power management is disabled. */
> +	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
> +	/** Device power management is enabled. */
> +	RTE_ETH_DEV_POWER_MGMT_ENABLED
> +};
> +
>  struct rte_eth_dev_sriov {
>  	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
>  	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
> @@ -4302,6 +4314,38 @@ __rte_experimental
>  int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
>  				       struct rte_eth_hairpin_cap *cap);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable device power management.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_eth_dev_power_mgmt_enable(uint16_t port_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Disable device power management.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_eth_dev_power_mgmt_disable(uint16_t port_id);
> +
>  #include <rte_ethdev_core.h>
> 
>  /**
> @@ -4417,6 +4461,32 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>  		} while (cb != NULL);
>  	}
>  #endif
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +			dev->empty_poll_stats[queue_id].num++;
> +			if (unlikely(dev->empty_poll_stats[queue_id].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +				volatile void *target_addr;
> +				uint64_t expected, mask;
> +				int ret;
> +
> +				/*
> +				 * get address of next descriptor in the RX
> +				 * ring for this queue, as well as expected
> +				 * value and a mask.
> +				 */
> +				ret = (*dev->dev_ops->next_rx_desc)
> +					(dev->data->rx_queues[queue_id],
> +					 &target_addr, &expected, &mask);

That makes every PMD that doesn't support next_rx_desc op to crash.
One simple way to avoid it - check in rte_eth_dev_power_mgmt_enable() that PMD
does implement ops->next_rx_desc.
Though I don't think introducing such new op is a best approach, as it implies
that PMD does have HW RX descriptor mapped into WB-type memory, and dictates 
to PMD on what it should sleep on.
Though depending on HW/SW capabilities and implementation PMD might choose to
sleep on different thing (HW doorbell, SW cond var, etc.).
Another thing - I doubt it is a good idea to pollute generic RX function with power
specific code (again, as I said above it probably wouldn't be that generic for all possible PMDs).
From my perspective we have 2 alternatives to implement such functionality:
1. Keep rte_eth_dev_power_mgmt_enable/disable(port, queue) and move actual 
    *wait_on* code into the PMD RX implementations (we probably can still have some common.      
    logic about allowed number of empty polls, max timeout to sleep, etc.).
2. Drop rte_eth_dev_power_mgmt_enable/disable and introduce explicit:
    rte_eth_dev_wait_for_packet(port, queue, timeout)  API function.
    
In both cases PMD will have a full freedom to implement *wait_on_packet* functionality 
in a most convenient way.
For 2) user would have to do some extra work himself
(count number of consecutive empty polls, call *wait_on_packet* function explicitly).
Though I think it can be easily hidden inside some wrapper API on top
of rte_eth_rx_burst()/rte_eth-dev_wait_for_packet().
Something like rte_eth_rx_burst_wait() or so.
We can have logic about allowed number of empty polls,
might be some other conditions in that top level function.
In that case changes in the user app will still be minimal. 
From other side 2) gives user explicit control on where and when to sleep,
so from my perspective it seems more straightforward and flexible.

> +				if (ret == 0)
> +					/* -1ULL is maximum value for TSC */
> +					rte_power_monitor(target_addr,
> +							  expected, mask,
> +							  0, -1ULL);
> +			}
> +		} else
> +			dev->empty_poll_stats[queue_id].num = 0;
> +	}
> 
>  	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
>  	return nb_rx;
> diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
> index 32407dd418..4e23d465f0 100644
> --- a/lib/librte_ethdev/rte_ethdev_core.h
> +++ b/lib/librte_ethdev/rte_ethdev_core.h
> @@ -603,6 +603,27 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>  	 uint16_t nb_tx_desc,
>  	 const struct rte_eth_hairpin_conf *hairpin_conf);
> 
> +/**
> + * @internal
> + * Get the next RX ring descriptor address.
> + *
> + * @param rxq
> + *   ethdev queue pointer.
> + * @param tail_desc_addr
> + *   the pointer point to descriptor address var.
> + *
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success.
> + * @retval -EINVAL
> + *   Failed to get descriptor address.
> + */
> +typedef int (*eth_next_rx_desc_t)
> +	(void *rxq, volatile void **tail_desc_addr,
> +	 uint64_t *expected, uint64_t *mask);
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -752,6 +773,8 @@ struct eth_dev_ops {
>  	/**< Set up device RX hairpin queue. */
>  	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>  	/**< Set up device TX hairpin queue. */
> +	eth_next_rx_desc_t next_rx_desc;
> +	/**< Get next RX ring descriptor address. */
>  };
> 
>  /**
> @@ -768,6 +791,14 @@ struct rte_eth_rxtx_callback {
>  	void *param;
>  };
> 
> +/**
> + * @internal
> + * Structure used to hold counters for empty poll
> + */
> +struct rte_eth_ep_stat {
> +	uint64_t num;
> +} __rte_cache_aligned;
> +
>  /**
>   * @internal
>   * The generic data structure associated with each ethernet device.
> @@ -807,8 +838,14 @@ struct rte_eth_dev {
>  	enum rte_eth_dev_state state; /**< Flag indicating the port state */
>  	void *security_ctx; /**< Context for security ops */
> 
> -	uint64_t reserved_64s[4]; /**< Reserved for future fields */
> -	void *reserved_ptrs[4];   /**< Reserved for future fields */
> +	/**< Empty poll number */
> +	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
> +	uint32_t reserved_32;
> +	uint64_t reserved_64s[3]; /**< Reserved for future fields */
> +
> +	/**< Flag indicating the port power state */
> +	struct rte_eth_ep_stat *empty_poll_stats;
> +	void *reserved_ptrs[3];   /**< Reserved for future fields */
>  } __rte_cache_aligned;
> 
>  struct rte_eth_dev_sriov;
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> index 7155056045..141361823d 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -241,4 +241,8 @@ EXPERIMENTAL {
>  	__rte_ethdev_trace_rx_burst;
>  	__rte_ethdev_trace_tx_burst;
>  	rte_flow_get_aged_flows;
> +
> +	# added in 20.08
> +	rte_eth_dev_power_mgmt_disable;
> +	rte_eth_dev_power_mgmt_enable;
>  };
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 11:39   ` Ananyev, Konstantin
@ 2020-05-28 14:40     ` Burakov, Anatoly
  2020-05-28 14:58       ` Bruce Richardson
  2020-05-28 15:38       ` Ananyev, Konstantin
  0 siblings, 2 replies; 56+ messages in thread
From: Burakov, Anatoly @ 2020-05-28 14:40 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli

On 28-May-20 12:39 PM, Ananyev, Konstantin wrote:
> Hi Anatoly,
> 
>>
>> Add two new power management intrinsics, and provide an implementation
>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>> are implemented as raw byte opcodes because there is not yet widespread
>> compiler support for these instructions.
>>
>> The power management instructions provide an architecture-specific
>> function to either wait until a specified TSC timestamp is reached, or
>> optionally wait until either a TSC timestamp is reached or a memory
>> location is written to. The monitor function also provides an optional
>> comparison, to avoid sleeping when the expected write has already
>> happened, and no more writes are expected.
> 
> Recently ARM guys introduced new generic API
> for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> Probably would make sense to unite both APIs into something common
> and HW transparent.
> Konstantin

Hi Konstantin,

That's not really similar purpose. This is monitoring a cacheline for 
writes, not waiting on a specific value. The "expected" value is there 
as basically a hack to get around the race condition due to the fact 
that by the time you enter monitoring state, the write you're waiting 
for may have already happened.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 14:40     ` Burakov, Anatoly
@ 2020-05-28 14:58       ` Bruce Richardson
  2020-05-28 15:38       ` Ananyev, Konstantin
  1 sibling, 0 replies; 56+ messages in thread
From: Bruce Richardson @ 2020-05-28 14:58 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Ananyev, Konstantin, dev, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli

On Thu, May 28, 2020 at 03:40:18PM +0100, Burakov, Anatoly wrote:
> On 28-May-20 12:39 PM, Ananyev, Konstantin wrote:
> > Hi Anatoly,
> > 
> > > 
> > > Add two new power management intrinsics, and provide an implementation
> > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > are implemented as raw byte opcodes because there is not yet widespread
> > > compiler support for these instructions.
> > > 
> > > The power management instructions provide an architecture-specific
> > > function to either wait until a specified TSC timestamp is reached, or
> > > optionally wait until either a TSC timestamp is reached or a memory
> > > location is written to. The monitor function also provides an optional
> > > comparison, to avoid sleeping when the expected write has already
> > > happened, and no more writes are expected.
> > 
> > Recently ARM guys introduced new generic API
> > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > Probably would make sense to unite both APIs into something common
> > and HW transparent.
> > Konstantin
> 
> Hi Konstantin,
> 
> That's not really similar purpose. This is monitoring a cacheline for
> writes, not waiting on a specific value. The "expected" value is there as
> basically a hack to get around the race condition due to the fact that by
> the time you enter monitoring state, the write you're waiting for may have
> already happened.
> 
Rather than the "expected" value, is it not more useful for a general API
to check for the existing value? Since we are awaiting writes, we may not
know what new value will be written, but we do know what the value is now,
and can just check that it's not been changed.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 14:40     ` Burakov, Anatoly
  2020-05-28 14:58       ` Bruce Richardson
@ 2020-05-28 15:38       ` Ananyev, Konstantin
  2020-05-29  6:56         ` Jerin Jacob
  1 sibling, 1 reply; 56+ messages in thread
From: Ananyev, Konstantin @ 2020-05-28 15:38 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli


> > Hi Anatoly,
> >
> >>
> >> Add two new power management intrinsics, and provide an implementation
> >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >> are implemented as raw byte opcodes because there is not yet widespread
> >> compiler support for these instructions.
> >>
> >> The power management instructions provide an architecture-specific
> >> function to either wait until a specified TSC timestamp is reached, or
> >> optionally wait until either a TSC timestamp is reached or a memory
> >> location is written to. The monitor function also provides an optional
> >> comparison, to avoid sleeping when the expected write has already
> >> happened, and no more writes are expected.
> >
> > Recently ARM guys introduced new generic API
> > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > Probably would make sense to unite both APIs into something common
> > and HW transparent.
> > Konstantin
> 
> Hi Konstantin,
> 
> That's not really similar purpose. This is monitoring a cacheline for
> writes, not waiting on a specific value. 

I understand that.

> The "expected" value is there
> as basically a hack to get around the race condition due to the fact
> that by the time you enter monitoring state, the write you're waiting
> for may have already happened.

AFAIK, current rte_wait_until_equal_* does pretty much the same thing:

LDXR memaddr, $reg  // an address to monitor for
if ($reg != expected_value)
   SEVL      //     arm monitor
   do {
       WFE     //      waits for write to that memory address  
       LDXR memaddr, $reg
   } while ($reg != expected_value);   
 
Looks pretty similar to what rte_power_monitor() does,
except you don't have a loop for checking the new value.
Plus rte_power_monitor() provides extra options to the user - 
timestamp and power save mode to enter.
Also I don't know what is the granularity of such events on ARM,
is it a cache-line or more/less.
Might be ARM people can comment/correct me here. 
Konstantin

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-28 15:38       ` Ananyev, Konstantin
@ 2020-05-29  6:56         ` Jerin Jacob
  2020-06-02 10:15           ` Ananyev, Konstantin
  2020-06-03  6:22           ` Honnappa Nagarahalli
  0 siblings, 2 replies; 56+ messages in thread
From: Jerin Jacob @ 2020-05-29  6:56 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma,
	Liang J, Honnappa.Nagarahalli

On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
> > > Hi Anatoly,
> > >
> > >>
> > >> Add two new power management intrinsics, and provide an implementation
> > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > >> are implemented as raw byte opcodes because there is not yet widespread
> > >> compiler support for these instructions.
> > >>
> > >> The power management instructions provide an architecture-specific
> > >> function to either wait until a specified TSC timestamp is reached, or
> > >> optionally wait until either a TSC timestamp is reached or a memory
> > >> location is written to. The monitor function also provides an optional
> > >> comparison, to avoid sleeping when the expected write has already
> > >> happened, and no more writes are expected.
> > >
> > > Recently ARM guys introduced new generic API
> > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > > Probably would make sense to unite both APIs into something common
> > > and HW transparent.
> > > Konstantin
> >
> > Hi Konstantin,
> >
> > That's not really similar purpose. This is monitoring a cacheline for
> > writes, not waiting on a specific value.
>
> I understand that.
>
> > The "expected" value is there
> > as basically a hack to get around the race condition due to the fact
> > that by the time you enter monitoring state, the write you're waiting
> > for may have already happened.
>
> AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
>
> LDXR memaddr, $reg  // an address to monitor for
> if ($reg != expected_value)
>    SEVL      //     arm monitor
>    do {
>        WFE     //      waits for write to that memory address
>        LDXR memaddr, $reg
>    } while ($reg != expected_value);
>
> Looks pretty similar to what rte_power_monitor() does,
> except you don't have a loop for checking the new value.
> Plus rte_power_monitor() provides extra options to the user -
> timestamp and power save mode to enter.
> Also I don't know what is the granularity of such events on ARM,
> is it a cache-line or more/less.

As I understand it, Granularity is per the cache-line.
ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power
state until the cache line is written.

But I see UMONITOR bit different, Where _without_ other core signaling
to wakeup from wait state,
it can wake on TSC expiry. I think, that's is the main primitive on
this feature. Right?

WFE can also wake based on Timer stream events(kind of TSC in x86
analogy) but it has a configuration
bit that needs to allow for this scheme in userspace(EL0) or not?
defined by EL1(Linux kernel).
I am planning to spend time on this after understanding the value
addition of the feature/usecase[1]
[1]
http://mails.dpdk.org/archives/dev/2020-May/168888.html





> Might be ARM people can comment/correct me here.
> Konstantin

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-29  6:56         ` Jerin Jacob
@ 2020-06-02 10:15           ` Ananyev, Konstantin
  2020-06-03  6:22           ` Honnappa Nagarahalli
  1 sibling, 0 replies; 56+ messages in thread
From: Ananyev, Konstantin @ 2020-06-02 10:15 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma,
	Liang J, Honnappa.Nagarahalli


> 
> On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> > > > Hi Anatoly,
> > > >
> > > >>
> > > >> Add two new power management intrinsics, and provide an implementation
> > > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > >> are implemented as raw byte opcodes because there is not yet widespread
> > > >> compiler support for these instructions.
> > > >>
> > > >> The power management instructions provide an architecture-specific
> > > >> function to either wait until a specified TSC timestamp is reached, or
> > > >> optionally wait until either a TSC timestamp is reached or a memory
> > > >> location is written to. The monitor function also provides an optional
> > > >> comparison, to avoid sleeping when the expected write has already
> > > >> happened, and no more writes are expected.
> > > >
> > > > Recently ARM guys introduced new generic API
> > > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > Probably would make sense to unite both APIs into something common
> > > > and HW transparent.
> > > > Konstantin
> > >
> > > Hi Konstantin,
> > >
> > > That's not really similar purpose. This is monitoring a cacheline for
> > > writes, not waiting on a specific value.
> >
> > I understand that.
> >
> > > The "expected" value is there
> > > as basically a hack to get around the race condition due to the fact
> > > that by the time you enter monitoring state, the write you're waiting
> > > for may have already happened.
> >
> > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> >
> > LDXR memaddr, $reg  // an address to monitor for
> > if ($reg != expected_value)
> >    SEVL      //     arm monitor
> >    do {
> >        WFE     //      waits for write to that memory address
> >        LDXR memaddr, $reg
> >    } while ($reg != expected_value);
> >
> > Looks pretty similar to what rte_power_monitor() does,
> > except you don't have a loop for checking the new value.
> > Plus rte_power_monitor() provides extra options to the user -
> > timestamp and power save mode to enter.
> > Also I don't know what is the granularity of such events on ARM,
> > is it a cache-line or more/less.
> 
> As I understand it, Granularity is per the cache-line.
> ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power
> state until the cache line is written.
> 
> But I see UMONITOR bit different, Where _without_ other core signaling
> to wakeup from wait state,
> it can wake on TSC expiry. I think, that's is the main primitive on
> this feature. Right?
> 
> WFE can also wake based on Timer stream events(kind of TSC in x86
> analogy) but it has a configuration
> bit that needs to allow for this scheme in userspace(EL0) or not?
> defined by EL1(Linux kernel).
> I am planning to spend time on this after understanding the value
> addition of the feature/usecase[1]
> [1]
> http://mails.dpdk.org/archives/dev/2020-May/168888.html
> 

Ok, if there is a consensus to keep these two APIs disjoint for now -
I wouldn't insist.
Konstantin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-05-29  6:56         ` Jerin Jacob
  2020-06-02 10:15           ` Ananyev, Konstantin
@ 2020-06-03  6:22           ` Honnappa Nagarahalli
  2020-06-03  6:31             ` Jerin Jacob
  1 sibling, 1 reply; 56+ messages in thread
From: Honnappa Nagarahalli @ 2020-06-03  6:22 UTC (permalink / raw)
  To: Jerin Jacob, Ananyev, Konstantin
  Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma,
	Liang J, nd, nd

<snip>

> 
> On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> > > > Hi Anatoly,
> > > >
> > > >>
> > > >> Add two new power management intrinsics, and provide an
> > > >> implementation in eal/x86 based on UMONITOR/UMWAIT instructions.
> > > >> The instructions are implemented as raw byte opcodes because
> > > >> there is not yet widespread compiler support for these instructions.
> > > >>
> > > >> The power management instructions provide an
> > > >> architecture-specific function to either wait until a specified
> > > >> TSC timestamp is reached, or optionally wait until either a TSC
> > > >> timestamp is reached or a memory location is written to. The
> > > >> monitor function also provides an optional comparison, to avoid
> > > >> sleeping when the expected write has already happened, and no more
> writes are expected.
> > > >
> > > > Recently ARM guys introduced new generic API for similar (as I
> > > > understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > Probably would make sense to unite both APIs into something common
> > > > and HW transparent.
> > > > Konstantin
> > >
> > > Hi Konstantin,
> > >
> > > That's not really similar purpose. This is monitoring a cacheline
> > > for writes, not waiting on a specific value.
> >
> > I understand that.
> >
> > > The "expected" value is there
> > > as basically a hack to get around the race condition due to the fact
> > > that by the time you enter monitoring state, the write you're
> > > waiting for may have already happened.
> >
> > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> >
> > LDXR memaddr, $reg  // an address to monitor for if ($reg !=
> > expected_value)
> >    SEVL      //     arm monitor
> >    do {
> >        WFE     //      waits for write to that memory address
> >        LDXR memaddr, $reg
> >    } while ($reg != expected_value);
> >
> > Looks pretty similar to what rte_power_monitor() does, except you
> > don't have a loop for checking the new value.
> > Plus rte_power_monitor() provides extra options to the user -
> > timestamp and power save mode to enter.
> > Also I don't know what is the granularity of such events on ARM, is it
> > a cache-line or more/less.
> 
> As I understand it, Granularity is per the cache-line.
> ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until
> the cache line is written.
Architecture allows for 16B to 2048B space. Typically, implementations use cache-line granularity.

> 
> But I see UMONITOR bit different, Where _without_ other core signaling to
> wakeup from wait state, it can wake on TSC expiry. I think, that's is the main
> primitive on this feature. Right?
> 
> WFE can also wake based on Timer stream events(kind of TSC in x86
> analogy) but it has a configuration
> bit that needs to allow for this scheme in userspace(EL0) or not?
> defined by EL1(Linux kernel).
Timer stream events are not per CPU core. They are system wide streams.

> I am planning to spend time on this after understanding the value addition of
> the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020-
> May/168888.html
> 
> 
> 
> 
> 
> > Might be ARM people can comment/correct me here.
> > Konstantin

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics
  2020-06-03  6:22           ` Honnappa Nagarahalli
@ 2020-06-03  6:31             ` Jerin Jacob
  0 siblings, 0 replies; 56+ messages in thread
From: Jerin Jacob @ 2020-06-03  6:31 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Ananyev, Konstantin, Burakov, Anatoly, dev, Richardson, Bruce,
	Hunt, David, Ma, Liang J, nd

On Wed, Jun 3, 2020 at 11:53 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> <snip>
>
> >
> > On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> > >
> > >
> > > > > Hi Anatoly,
> > > > >
> > > > >>
> > > > >> Add two new power management intrinsics, and provide an
> > > > >> implementation in eal/x86 based on UMONITOR/UMWAIT instructions.
> > > > >> The instructions are implemented as raw byte opcodes because
> > > > >> there is not yet widespread compiler support for these instructions.
> > > > >>
> > > > >> The power management instructions provide an
> > > > >> architecture-specific function to either wait until a specified
> > > > >> TSC timestamp is reached, or optionally wait until either a TSC
> > > > >> timestamp is reached or a memory location is written to. The
> > > > >> monitor function also provides an optional comparison, to avoid
> > > > >> sleeping when the expected write has already happened, and no more
> > writes are expected.
> > > > >
> > > > > Recently ARM guys introduced new generic API for similar (as I
> > > > > understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > > Probably would make sense to unite both APIs into something common
> > > > > and HW transparent.
> > > > > Konstantin
> > > >
> > > > Hi Konstantin,
> > > >
> > > > That's not really similar purpose. This is monitoring a cacheline
> > > > for writes, not waiting on a specific value.
> > >
> > > I understand that.
> > >
> > > > The "expected" value is there
> > > > as basically a hack to get around the race condition due to the fact
> > > > that by the time you enter monitoring state, the write you're
> > > > waiting for may have already happened.
> > >
> > > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> > >
> > > LDXR memaddr, $reg  // an address to monitor for if ($reg !=
> > > expected_value)
> > >    SEVL      //     arm monitor
> > >    do {
> > >        WFE     //      waits for write to that memory address
> > >        LDXR memaddr, $reg
> > >    } while ($reg != expected_value);
> > >
> > > Looks pretty similar to what rte_power_monitor() does, except you
> > > don't have a loop for checking the new value.
> > > Plus rte_power_monitor() provides extra options to the user -
> > > timestamp and power save mode to enter.
> > > Also I don't know what is the granularity of such events on ARM, is it
> > > a cache-line or more/less.
> >
> > As I understand it, Granularity is per the cache-line.
> > ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until
> > the cache line is written.
> Architecture allows for 16B to 2048B space. Typically, implementations use cache-line granularity.
>
> >
> > But I see UMONITOR bit different, Where _without_ other core signaling to
> > wakeup from wait state, it can wake on TSC expiry. I think, that's is the main
> > primitive on this feature. Right?
> >
> > WFE can also wake based on Timer stream events(kind of TSC in x86
> > analogy) but it has a configuration
> > bit that needs to allow for this scheme in userspace(EL0) or not?
> > defined by EL1(Linux kernel).
> Timer stream events are not per CPU core. They are system wide streams.

We may not need per core support to implement this use case.

I think, currently, kernel configured to have a WFE signal on every
100us.(System-wide).

do while{} loop can check if it is passing the requested timestamp after WFE.
But minimum granularity will be 100us.(i.e 100us worth of ticks)


>
> > I am planning to spend time on this after understanding the value addition of
> > the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020-
> > May/168888.html
> >
> >
> >
> >
> >
> > > Might be ARM people can comment/correct me here.
> > > Konstantin

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics
  2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
                   ` (6 preceding siblings ...)
  2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
@ 2020-08-11 10:27 ` Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
                     ` (6 more replies)
  7 siblings, 7 replies; 56+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 138 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 207 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 000000000..8646c4ac1
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd0902795..3a12e87e1 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2f..494a8142a 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d..94d6a4376 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
 	RTE_CPUFLAG_EM64T,                  /**< EM64T */
 
+	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
 	/* (EAX 80000007h) EDX features */
 	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
 
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 000000000..af8aa9459
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,138 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+	rte_mb();
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e795..0325c4b93 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
@ 2020-08-11 10:27   ` Liang Ma
  2020-08-13 18:11     ` Liang, Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API Liang Ma
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 56+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 queue use case as there is no
coordination between queues/cores in ethdev.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to releaf the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 config/common_base                       |   4 +-
 lib/Makefile                             |   1 +
 lib/librte_ethdev/Makefile               |   2 +-
 lib/librte_ethdev/meson.build            |   2 +-
 lib/librte_ethdev/rte_ethdev.c           | 198 +++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h           |  59 +++++++
 lib/librte_ethdev/rte_ethdev_core.h      |  43 ++++-
 lib/librte_ethdev/rte_ethdev_version.map |   4 +
 lib/meson.build                          |   5 +-
 mk/rte.app.mk                            |   2 +-
 10 files changed, 311 insertions(+), 9 deletions(-)

diff --git a/config/common_base b/config/common_base
index f76585f16..e0948f0cb 100644
--- a/config/common_base
+++ b/config/common_base
@@ -155,7 +155,7 @@ CONFIG_RTE_MAX_ETHPORTS=32
 CONFIG_RTE_MAX_QUEUES_PER_PORT=1024
 CONFIG_RTE_LIBRTE_IEEE1588=n
 CONFIG_RTE_ETHDEV_QUEUE_STAT_CNTRS=16
-CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=y
+CONFIG_RTE_ETHDEV_RXTX_CALLBACKS=n
 CONFIG_RTE_ETHDEV_PROFILE_WITH_VTUNE=n
 
 #
@@ -978,7 +978,7 @@ CONFIG_RTE_LIBRTE_ACL_DEBUG=n
 #
 # Compile librte_power
 #
-CONFIG_RTE_LIBRTE_POWER=n
+CONFIG_RTE_LIBRTE_POWER=y
 CONFIG_RTE_LIBRTE_POWER_DEBUG=n
 CONFIG_RTE_MAX_LCORE_FREQS=64
 
diff --git a/lib/Makefile b/lib/Makefile
index 8f5b68a2d..87646698a 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -28,6 +28,7 @@ DEPDIRS-librte_ethdev := librte_net librte_eal librte_mempool librte_ring
 DEPDIRS-librte_ethdev += librte_mbuf
 DEPDIRS-librte_ethdev += librte_kvargs
 DEPDIRS-librte_ethdev += librte_meter
+DEPDIRS-librte_ethdev += librte_power
 DIRS-$(CONFIG_RTE_LIBRTE_BBDEV) += librte_bbdev
 DEPDIRS-librte_bbdev := librte_eal librte_mempool librte_mbuf
 DIRS-$(CONFIG_RTE_LIBRTE_CRYPTODEV) += librte_cryptodev
diff --git a/lib/librte_ethdev/Makefile b/lib/librte_ethdev/Makefile
index 47747150b..6a4ce14cf 100644
--- a/lib/librte_ethdev/Makefile
+++ b/lib/librte_ethdev/Makefile
@@ -11,7 +11,7 @@ LIB = librte_ethdev.a
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_net -lrte_eal -lrte_mempool -lrte_ring
-LDLIBS += -lrte_mbuf -lrte_kvargs -lrte_meter -lrte_telemetry
+LDLIBS += -lrte_mbuf -lrte_kvargs -lrte_meter -lrte_telemetry -lrte_power
 
 EXPORT_MAP := rte_ethdev_version.map
 
diff --git a/lib/librte_ethdev/meson.build b/lib/librte_ethdev/meson.build
index 8fc24e8c8..e09e2395e 100644
--- a/lib/librte_ethdev/meson.build
+++ b/lib/librte_ethdev/meson.build
@@ -27,4 +27,4 @@ headers = files('rte_ethdev.h',
 	'rte_tm.h',
 	'rte_tm_driver.h')
 
-deps += ['net', 'kvargs', 'meter', 'telemetry']
+deps += ['net', 'kvargs', 'meter', 'telemetry', 'power']
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 7858ad5f1..b43de88ce 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -16,6 +16,7 @@
 #include <netinet/in.h>
 
 #include <rte_byteorder.h>
+#include <rte_cpuflags.h>
 #include <rte_log.h>
 #include <rte_debug.h>
 #include <rte_interrupts.h>
@@ -39,6 +40,7 @@
 #include <rte_class.h>
 #include <rte_ether.h>
 #include <rte_telemetry.h>
+#include <rte_power.h>
 
 #include "rte_ethdev_trace.h"
 #include "rte_ethdev.h"
@@ -185,6 +187,100 @@ enum {
 	STAT_QMAP_RX
 };
 
+
+static uint16_t
+rte_ethdev_pmgmt_umait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+			dev->empty_poll_stats[qidx].num++;
+			if (unlikely(dev->empty_poll_stats[qidx].num >
+					ETH_EMPTYPOLL_MAX)) {
+				volatile void *target_addr;
+				uint64_t expected, mask;
+				uint16_t ret;
+
+				/*
+				 * get address of next descriptor in the RX
+				 * ring for this queue, as well as expected
+				 * value and a mask.
+				 */
+				ret = (*dev->dev_ops->next_rx_desc)
+					(dev->data->rx_queues[qidx],
+					 &target_addr, &expected, &mask);
+				if (ret == 0)
+					/* -1ULL is maximum value for TSC */
+					rte_power_monitor(target_addr,
+							  expected, mask,
+							  0, -1ULL);
+			}
+		} else
+			dev->empty_poll_stats[qidx].num = 0;
+	}
+
+	return 0;
+}
+
+static uint16_t
+rte_ethdev_pmgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	int i;
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+
+			dev->empty_poll_stats[qidx].num++;
+
+			if (unlikely(dev->empty_poll_stats[qidx].num >
+					ETH_EMPTYPOLL_MAX)) {
+
+				for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
+					rte_pause();
+
+			}
+		} else
+			dev->empty_poll_stats[qidx].num = 0;
+	}
+
+	return 0;
+}
+
+static uint16_t
+rte_ethdev_pmgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
+		if (unlikely(nb_rx == 0)) {
+			dev->empty_poll_stats[qidx].num++;
+			if (unlikely(dev->empty_poll_stats[qidx].num >
+					ETH_EMPTYPOLL_MAX)) {
+
+				/*scale down freq */
+				rte_power_freq_min(rte_lcore_id());
+
+			}
+		} else {
+			dev->empty_poll_stats[qidx].num = 0;
+			/* scal up freq */
+			rte_power_freq_max(rte_lcore_id());
+		}
+	}
+
+	return 0;
+}
+
 int
 rte_eth_iterator_init(struct rte_dev_iterator *iter, const char *devargs_str)
 {
@@ -5113,6 +5209,108 @@ rte_eth_dev_pool_ops_supported(uint16_t port_id, const char *pool)
 	return (*dev->dev_ops->pool_ops_supported)(dev, pool);
 }
 
+int
+rte_eth_dev_power_mgmt_enable(unsigned int lcore_id,
+			      uint16_t port_id,
+			 enum rte_eth_dev_power_mgmt_cb_mode mode)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+						  sizeof(struct rte_eth_ep_stat)
+						  * RTE_MAX_QUEUES_PER_PORT,
+						  0, dev->data->numa_node);
+
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
+		return -EINVAL;
+
+	dev->cb_mode = mode;
+
+	switch (mode) {
+
+	case RTE_ETH_DEV_POWER_MGMT_CB_UMWAIT:
+
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+			return -ENOTSUP;
+
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_ethdev_pmgmt_umait, NULL);
+		break;
+
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+
+		/* init scale freq */
+		if (rte_power_init(lcore_id))
+			return -EINVAL;
+
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+					rte_ethdev_pmgmt_scalefreq, NULL);
+		break;
+
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_ethdev_pmgmt_pause, NULL);
+		break;
+
+	}
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_eth_dev_power_mgmt_disable(unsigned int lcore_id,
+			       uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/*add flag check */
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)  {
+		/* rte_free ignores NULL so safe to call without checks */
+		rte_free(dev->empty_poll_stats);
+
+		switch (dev->cb_mode) {
+
+		case RTE_ETH_DEV_POWER_MGMT_CB_UMWAIT:
+
+		case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+
+			rte_eth_remove_rx_callback(port_id, 0,
+						   dev->cur_pwr_cb);
+
+			break;
+
+		case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+
+			rte_power_freq_max(lcore_id);
+
+			rte_eth_remove_rx_callback(port_id, 0,
+						   dev->cur_pwr_cb);
+
+			if (rte_power_exit(lcore_id))
+				return -EINVAL;
+
+			break;
+		}
+
+		dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+
+	}
+	return 0;
+}
+
 /**
  * A set of values to describe the possible states of a switch domain.
  */
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 57e4a6ca5..6858c0338 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1603,6 +1605,25 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED,
+};
+
+enum rte_eth_dev_power_mgmt_cb_mode {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_CB_UMWAIT = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
+	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
@@ -4415,6 +4436,40 @@ __rte_experimental
 int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
 				       struct rte_eth_hairpin_cap *cap);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_enable(unsigned int lcore_id,
+				  uint16_t port_id,
+				  enum rte_eth_dev_power_mgmt_cb_mode mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_eth_dev_power_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
+
 #include <rte_ethdev_core.h>
 
 /**
@@ -4535,6 +4590,7 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 	return nb_rx;
 }
 
+
 /**
  * Get the number of used descriptors of a rx queue
  *
@@ -4993,6 +5049,9 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
 	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
 }
 
+
+
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd41..7d6d85ddc 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,27 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ *
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +773,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +791,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +838,16 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
+	uint32_t reserved_32;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	const struct rte_eth_rxtx_callback *cur_pwr_cb;
+	void *reserved_ptrs[3];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
index 1212a17d3..4d5b63a5b 100644
--- a/lib/librte_ethdev/rte_ethdev_version.map
+++ b/lib/librte_ethdev/rte_ethdev_version.map
@@ -241,6 +241,10 @@ EXPERIMENTAL {
 	__rte_ethdev_trace_rx_burst;
 	__rte_ethdev_trace_tx_burst;
 	rte_flow_get_aged_flows;
+
+	# added in 20.08
+	rte_eth_dev_power_mgmt_disable;
+	rte_eth_dev_power_mgmt_enable;
 };
 
 INTERNAL {
diff --git a/lib/meson.build b/lib/meson.build
index 3852c0156..54cc0db7d 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -14,17 +14,18 @@ libraries = [
 	'eal', # everything depends on eal
 	'ring',
 	'rcu', # rcu depends on ring
+	'timer',   # eventdev depends on this
+	'power',   # eventdev depends on this
 	'mempool', 'mbuf', 'net', 'meter', 'ethdev', 'pci', # core
 	'cmdline',
 	'metrics', # bitrate/latency stats depends on this
 	'hash',    # efd depends on this
-	'timer',   # eventdev depends on this
 	'acl', 'bbdev', 'bitratestats', 'cfgfile',
 	'compressdev', 'cryptodev',
 	'distributor', 'efd', 'eventdev',
 	'gro', 'gso', 'ip_frag', 'jobstats',
 	'kni', 'latencystats', 'lpm', 'member',
-	'power', 'pdump', 'rawdev', 'regexdev',
+	'pdump', 'rawdev', 'regexdev',
 	'rib', 'reorder', 'sched', 'security', 'stack', 'vhost',
 	# ipsec lib depends on net, crypto and security
 	'ipsec',
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index a54425997..b87abb26e 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -58,7 +58,6 @@ endif
 _LDLIBS-$(CONFIG_RTE_LIBRTE_METRICS)        += --no-whole-archive
 _LDLIBS-$(CONFIG_RTE_LIBRTE_BITRATE)        += -lrte_bitratestats
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LATENCY_STATS)  += -lrte_latencystats
-_LDLIBS-$(CONFIG_RTE_LIBRTE_POWER)          += -lrte_power
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_EFD)            += -lrte_efd
 _LDLIBS-$(CONFIG_RTE_LIBRTE_BPF)            += -lrte_bpf
@@ -80,6 +79,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_KVARGS)         += -lrte_kvargs
 _LDLIBS-y                                   += -lrte_telemetry
 _LDLIBS-$(CONFIG_RTE_LIBRTE_MBUF)           += -lrte_mbuf
 _LDLIBS-$(CONFIG_RTE_LIBRTE_NET)            += -lrte_net
+_LDLIBS-$(CONFIG_RTE_LIBRTE_POWER)          += -lrte_power
 _LDLIBS-$(CONFIG_RTE_LIBRTE_ETHER)          += -lrte_ethdev
 _LDLIBS-$(CONFIG_RTE_LIBRTE_BBDEV)          += -lrte_bbdev
 _LDLIBS-$(CONFIG_RTE_LIBRTE_CRYPTODEV)      += -lrte_cryptodev
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
@ 2020-08-11 10:27   ` Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 4/5] net/i40e: " Liang Ma
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index fd0cb9b0e..618fc1573 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -592,6 +592,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf513..d1d015dea 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b2..826f451be 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC v2 4/5] net/i40e: implement power management API
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API Liang Ma
@ 2020-08-11 10:27   ` " Liang Ma
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 5/5] net/ice: " Liang Ma
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 05d5f2861..f0797c3cb 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -515,6 +515,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index fe7f9200c..9d7eea8ae 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160..bfda5b6ad 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC v2 5/5] net/ice: implement power management API
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (2 preceding siblings ...)
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 4/5] net/i40e: " Liang Ma
@ 2020-08-11 10:27   ` " Liang Ma
  2020-08-13 18:04   ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang, Ma
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-08-11 10:27 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 7dd3fcd27..7a636cd11 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -212,6 +212,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index cc3139042..ce7e025b6 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d0..7eb6fa904 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (3 preceding siblings ...)
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 5/5] net/ice: " Liang Ma
@ 2020-08-13 18:04   ` Liang, Ma
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  6 siblings, 0 replies; 56+ messages in thread
From: Liang, Ma @ 2020-08-13 18:04 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

On 11 Aug 11:27, Liang Ma wrote:
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> --- 
<snip> 
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d..94d6a4376 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
>  	RTE_CPUFLAG_EM64T,                  /**< EM64T */
>  
> +	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
need re-work the order to avoid breaking ABI
>  	/* (EAX 80000007h) EDX features */
>  	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
>  
</snip>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback
  2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
@ 2020-08-13 18:11     ` Liang, Ma
  0 siblings, 0 replies; 56+ messages in thread
From: Liang, Ma @ 2020-08-13 18:11 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

On 11 Aug 11:27, Liang Ma wrote:
<snip>
> +static uint16_t
> +rte_ethdev_pmgmt_umait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +			dev->empty_poll_stats[qidx].num++;
> +			if (unlikely(dev->empty_poll_stats[qidx].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +				volatile void *target_addr;
> +				uint64_t expected, mask;
> +				uint16_t ret;
> +
> +				/*
> +				 * get address of next descriptor in the RX
> +				 * ring for this queue, as well as expected
> +				 * value and a mask.
> +				 */
> +				ret = (*dev->dev_ops->next_rx_desc)
> +					(dev->data->rx_queues[qidx],
> +					 &target_addr, &expected, &mask);
> +				if (ret == 0)
> +					/* -1ULL is maximum value for TSC */
> +					rte_power_monitor(target_addr,
> +							  expected, mask,
> +							  0, -1ULL);
> +			}
> +		} else
> +			dev->empty_poll_stats[qidx].num = 0;
> +	}
> +
> +	return 0;
should return nb_rx here. that's fixed in v3.
> +}
> +
> +static uint16_t
> +rte_ethdev_pmgmt_pause(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	int i;
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +
> +			dev->empty_poll_stats[qidx].num++;
> +
> +			if (unlikely(dev->empty_poll_stats[qidx].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +
> +				for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
> +					rte_pause();
> +
> +			}
> +		} else
> +			dev->empty_poll_stats[qidx].num = 0;
> +	}
> +
> +	return 0;
should return  nb_rx here. that's fixed in v3.
> +}
> +
> +static uint16_t
> +rte_ethdev_pmgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED) {
> +		if (unlikely(nb_rx == 0)) {
> +			dev->empty_poll_stats[qidx].num++;
> +			if (unlikely(dev->empty_poll_stats[qidx].num >
> +					ETH_EMPTYPOLL_MAX)) {
> +
> +				/*scale down freq */
> +				rte_power_freq_min(rte_lcore_id());
> +
> +			}
> +		} else {
> +			dev->empty_poll_stats[qidx].num = 0;
> +			/* scal up freq */
> +			rte_power_freq_max(rte_lcore_id());
> +		}
> +	}
> +
> +	return 0;
should return  nb_rx here. that's fixed in v3.
> +}
> +
</snip>

 -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC PATCH v3 1/6] eal: add power management intrinsics
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (4 preceding siblings ...)
  2020-08-13 18:04   ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang, Ma
@ 2020-09-03 16:06   ` " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
                       ` (4 more replies)
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  6 siblings, 5 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-03 16:06 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   2 +
 .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 213 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..cd7f8070ac
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..5041a830a7 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	/**< UMWAIT/TPAUSE Instructions */
+	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..6dd1cdc939
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback Liang Ma
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple API allow ethdev get the last
available queue descriptor address from PMD.
Also include internal structure update.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.h      | 22 ++++++++++++++
 lib/librte_ethdev/rte_ethdev_core.h | 46 +++++++++++++++++++++++++++--
 2 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 70295d7ab7..d9312d3e11 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1602,6 +1604,26 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED,
+};
+
+enum rte_eth_dev_power_mgmt_cb_mode {
+	/** WAIT callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd418..16e54bb4e4 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,30 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ * @param expected
+ *   the pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   the pointer point to comparison bitmask for the expected value.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +776,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +794,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +841,16 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	/**< Power mgmt Callback mode */
+	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	const struct rte_eth_rxtx_callback *cur_pwr_cb;
+	void *reserved_ptrs[2];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 port 1 queue use case as there is
no coordination between queues/cores in ethdev. 1 port map to multiple
core will be supported in next version.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to releaf the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_power/meson.build           |   3 +-
 lib/librte_power/rte_power.h           |  38 +++++
 lib/librte_power/rte_power_pmd_mgmt.c  | 184 +++++++++++++++++++++++++
 lib/librte_power/rte_power_version.map |   4 +
 4 files changed, 228 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c

diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..44b01afce2 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
 headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power.h b/lib/librte_power/rte_power.h
index bbbde4dfb4..06d5a9984f 100644
--- a/lib/librte_power/rte_power.h
+++ b/lib/librte_power/rte_power.h
@@ -14,6 +14,7 @@
 #include <rte_byteorder.h>
 #include <rte_log.h>
 #include <rte_string_fns.h>
+#include <rte_ethdev.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -97,6 +98,43 @@ int rte_power_init(unsigned int lcore_id);
  */
 int rte_power_exit(unsigned int lcore_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function mode.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+				  uint16_t port_id,
+				  enum rte_eth_dev_power_mgmt_cb_mode mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
+
 /**
  * Get the available frequencies of a specific lcore.
  * Function pointer definition. Review each environments
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..a445153ede
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+
+#include "rte_power.h"
+
+
+
+static uint16_t
+rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = (*dev->dev_ops->next_rx_desc)
+				(dev->data->rx_queues[qidx],
+				 &target_addr, &expected, &mask);
+			if (ret == 0)
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor(target_addr,
+						  expected, mask,
+						  0, -1ULL);
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	int i;
+
+	if (unlikely(nb_rx == 0)) {
+
+		dev->empty_poll_stats[qidx].num++;
+
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
+				rte_pause();
+
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			/*scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+
+		}
+	} else {
+		dev->empty_poll_stats[qidx].num = 0;
+		/* scal up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+			uint16_t port_id,
+			enum rte_eth_dev_power_mgmt_cb_mode mode)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
+		return -EINVAL;
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+						  sizeof(struct rte_eth_ep_stat)
+						  * RTE_MAX_QUEUES_PER_PORT,
+						  0, dev->data->numa_node);
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	switch (mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+			return -ENOTSUP;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_umwait, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		/* init scale freq */
+		if (rte_power_init(lcore_id))
+			return -EINVAL;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+					rte_power_mgmt_scalefreq, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_pause, NULL);
+		break;
+	}
+
+	dev->cb_mode = mode;
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_power_pmd_mgmt_disable(unsigned int lcore_id,
+				uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/*add flag check */
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
+		return -EINVAL;
+
+	/* rte_free ignores NULL so safe to call without checks */
+	rte_free(dev->empty_poll_stats);
+
+	switch (dev->cb_mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		if (rte_power_exit(lcore_id))
+			return -EINVAL;
+		break;
+	}
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+	dev->cur_pwr_cb = NULL;
+	dev->cb_mode = 0;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 00ee5753e2..ade83cfd4f 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.08
+	rte_power_pmd_mgmt_disable;
+	rte_power_pmd_mgmt_enable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback Liang Ma
@ 2020-09-03 16:07     ` Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 6/6] net/ice: " Liang Ma
  4 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index fd0cb9b0e2..618fc15732 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -592,6 +592,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf5137..d1d015deae 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b22..826f451bee 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: implement power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
                       ` (2 preceding siblings ...)
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
@ 2020-09-03 16:07     ` " Liang Ma
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 6/6] net/ice: " Liang Ma
  4 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 11c02b1888..94e9298d7c 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -517,6 +517,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index fe7f9200c1..9d7eea8aed 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..bfda5b6ad3 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [RFC PATCH v3 6/6] net/ice: implement power management API
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
                       ` (3 preceding siblings ...)
  2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: " Liang Ma
@ 2020-09-03 16:07     ` " Liang Ma
  4 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-03 16:07 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 8d435e8892..7d7e1dcbac 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -212,6 +212,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 2e1f06d2c0..c043181ceb 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d04..7eb6fa904e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
                     ` (5 preceding siblings ...)
  2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
@ 2020-09-04 10:18   ` Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
                       ` (9 more replies)
  6 siblings, 10 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   2 +
 .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 213 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..cd7f8070ac
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@ generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@ arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..5041a830a7 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -132,6 +132,8 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_MOVDIR64B,              /**< Direct Store Instructions 64B */
 	RTE_CPUFLAG_AVX512VP2INTERSECT,     /**< AVX512 Two Register Intersection */
 
+	/**< UMWAIT/TPAUSE Instructions */
+	RTE_CPUFLAG_WAITPKG,                /**< UMINITOR/UMWAIT/TPAUSE */
 	/* The last item */
 	RTE_CPUFLAG_NUMFLAGS,               /**< This should always be the last! */
 };
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..6dd1cdc939
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 16:37       ` Stephen Hemminger
  2020-09-04 20:54       ` Ananyev, Konstantin
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
                       ` (8 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple API allow ethdev get the last
available queue descriptor address from PMD.
Also include internal structure update.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_ethdev/rte_ethdev.h      | 22 ++++++++++++++
 lib/librte_ethdev/rte_ethdev_core.h | 46 +++++++++++++++++++++++++++--
 2 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 70295d7ab7..d9312d3e11 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -157,6 +157,7 @@ extern "C" {
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_ether.h>
+#include <rte_power_intrinsics.h>
 
 #include "rte_ethdev_trace_fp.h"
 #include "rte_dev_info.h"
@@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
 /** Maximum nb. of vlan per mirror rule */
 #define ETH_MIRROR_MAX_VLANS       64
 
+#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
 #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
 #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
 #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
@@ -1602,6 +1604,26 @@ enum rte_eth_dev_state {
 	RTE_ETH_DEV_REMOVED,
 };
 
+#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum rte_eth_dev_power_mgmt_state {
+	/** Device power management is disabled. */
+	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	RTE_ETH_DEV_POWER_MGMT_ENABLED,
+};
+
+enum rte_eth_dev_power_mgmt_cb_mode {
+	/** WAIT callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
+};
+
 struct rte_eth_dev_sriov {
 	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
 	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 32407dd418..16e54bb4e4 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -603,6 +603,30 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
 	 uint16_t nb_tx_desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 
+/**
+ * @internal
+ * Get the next RX ring descriptor address.
+ *
+ * @param rxq
+ *   ethdev queue pointer.
+ * @param tail_desc_addr
+ *   the pointer point to descriptor address var.
+ * @param expected
+ *   the pointer point to value to be expected when descriptor is set.
+ * @param mask
+ *   the pointer point to comparison bitmask for the expected value.
+ * @return
+ *   Negative errno value on error, 0 on success.
+ *
+ * @retval 0
+ *   Success.
+ * @retval -EINVAL
+ *   Failed to get descriptor address.
+ */
+typedef int (*eth_next_rx_desc_t)
+	(void *rxq, volatile void **tail_desc_addr,
+	 uint64_t *expected, uint64_t *mask);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -752,6 +776,8 @@ struct eth_dev_ops {
 	/**< Set up device RX hairpin queue. */
 	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
 	/**< Set up device TX hairpin queue. */
+	eth_next_rx_desc_t next_rx_desc;
+	/**< Get next RX ring descriptor address. */
 };
 
 /**
@@ -768,6 +794,14 @@ struct rte_eth_rxtx_callback {
 	void *param;
 };
 
+/**
+ * @internal
+ * Structure used to hold counters for empty poll
+ */
+struct rte_eth_ep_stat {
+	uint64_t num;
+} __rte_cache_aligned;
+
 /**
  * @internal
  * The generic data structure associated with each ethernet device.
@@ -807,8 +841,16 @@ struct rte_eth_dev {
 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
 	void *security_ctx; /**< Context for security ops */
 
-	uint64_t reserved_64s[4]; /**< Reserved for future fields */
-	void *reserved_ptrs[4];   /**< Reserved for future fields */
+	/**< Empty poll number */
+	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
+	/**< Power mgmt Callback mode */
+	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
+	uint64_t reserved_64s[3]; /**< Reserved for future fields */
+
+	/**< Flag indicating the port power state */
+	struct rte_eth_ep_stat *empty_poll_stats;
+	const struct rte_eth_rxtx_callback *cur_pwr_cb;
+	void *reserved_ptrs[2];   /**< Reserved for future fields */
 } __rte_cache_aligned;
 
 struct rte_eth_dev_sriov;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 16:36       ` Stephen Hemminger
  2020-09-04 18:33       ` Ananyev, Konstantin
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
                       ` (7 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API is limited to 1 core 1 port 1 queue use case as there is
no coordination between queues/cores in ethdev. 1 port map to multiple
core will be supported in next version.

This design leverage RX Callback mechnaism which allow three
different power management methodology co exist.

1. umwait/umonitor:

   The TSC timestamp is automatically calculated using current
   link speed and RX descriptor ring size, such that the sleep
   time is not longer than it would take for a NIC to fill its
   entire RX descriptor ring.

2. Pause instruction

   Instead of move the core into deeper C state, this lightweight
   method use Pause instruction to releaf the processor from
   busy polling.

3. Frequency Scaling
   Reuse exist rte power library to scale up/down core frequency
   depend on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_power/meson.build           |   3 +-
 lib/librte_power/rte_power.h           |  38 +++++
 lib/librte_power/rte_power_pmd_mgmt.c  | 184 +++++++++++++++++++++++++
 lib/librte_power/rte_power_version.map |   4 +
 4 files changed, 228 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c

diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..44b01afce2 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
 headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power.h b/lib/librte_power/rte_power.h
index bbbde4dfb4..06d5a9984f 100644
--- a/lib/librte_power/rte_power.h
+++ b/lib/librte_power/rte_power.h
@@ -14,6 +14,7 @@
 #include <rte_byteorder.h>
 #include <rte_log.h>
 #include <rte_string_fns.h>
+#include <rte_ethdev.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -97,6 +98,43 @@ int rte_power_init(unsigned int lcore_id);
  */
 int rte_power_exit(unsigned int lcore_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function mode.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+				  uint16_t port_id,
+				  enum rte_eth_dev_power_mgmt_cb_mode mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Disable device power management.
+ * @param lcore_id
+ *   lcore id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ *
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int rte_power_pmd_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
+
 /**
  * Get the available frequencies of a specific lcore.
  * Function pointer definition. Review each environments
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..a445153ede
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+
+#include "rte_power.h"
+
+
+
+static uint16_t
+rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = (*dev->dev_ops->next_rx_desc)
+				(dev->data->rx_queues[qidx],
+				 &target_addr, &expected, &mask);
+			if (ret == 0)
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor(target_addr,
+						  expected, mask,
+						  0, -1ULL);
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	int i;
+
+	if (unlikely(nb_rx == 0)) {
+
+		dev->empty_poll_stats[qidx].num++;
+
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
+				rte_pause();
+
+		}
+	} else
+		dev->empty_poll_stats[qidx].num = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(nb_rx == 0)) {
+		dev->empty_poll_stats[qidx].num++;
+		if (unlikely(dev->empty_poll_stats[qidx].num >
+			     ETH_EMPTYPOLL_MAX)) {
+
+			/*scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+
+		}
+	} else {
+		dev->empty_poll_stats[qidx].num = 0;
+		/* scal up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_enable(unsigned int lcore_id,
+			uint16_t port_id,
+			enum rte_eth_dev_power_mgmt_cb_mode mode)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
+		return -EINVAL;
+	/* allocate memory for empty poll stats */
+	dev->empty_poll_stats = rte_malloc_socket(NULL,
+						  sizeof(struct rte_eth_ep_stat)
+						  * RTE_MAX_QUEUES_PER_PORT,
+						  0, dev->data->numa_node);
+	if (dev->empty_poll_stats == NULL)
+		return -ENOMEM;
+
+	switch (mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
+			return -ENOTSUP;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_umwait, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		/* init scale freq */
+		if (rte_power_init(lcore_id))
+			return -EINVAL;
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+					rte_power_mgmt_scalefreq, NULL);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
+						rte_power_mgmt_pause, NULL);
+		break;
+	}
+
+	dev->cb_mode = mode;
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
+	return 0;
+}
+
+int
+rte_power_pmd_mgmt_disable(unsigned int lcore_id,
+				uint16_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/*add flag check */
+
+	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
+		return -EINVAL;
+
+	/* rte_free ignores NULL so safe to call without checks */
+	rte_free(dev->empty_poll_stats);
+
+	switch (dev->cb_mode) {
+	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
+	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		break;
+	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, 0,
+					   dev->cur_pwr_cb);
+		if (rte_power_exit(lcore_id))
+			return -EINVAL;
+		break;
+	}
+
+	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
+	dev->cur_pwr_cb = NULL;
+	dev->cb_mode = 0;
+
+	return 0;
+}
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 00ee5753e2..ade83cfd4f 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.08
+	rte_power_pmd_mgmt_disable;
+	rte_power_pmd_mgmt_enable;
+
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
@ 2020-09-04 10:18     ` Liang Ma
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 5/6] net/i40e: " Liang Ma
                       ` (6 subsequent siblings)
  9 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 22 ++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index fd0cb9b0e2..618fc15732 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -592,6 +592,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.udp_tunnel_port_del  = ixgbe_dev_udp_tunnel_port_del,
 	.tm_ops_get           = ixgbe_tm_ops_get,
 	.tx_done_cleanup      = ixgbe_dev_tx_done_cleanup,
+	.next_rx_desc         = ixgbe_next_rx_desc,
 };
 
 /*
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 977ecf5137..d1d015deae 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,28 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.upper.status_error;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	*mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+
+	return 0;
+}
+
 /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */
 static inline uint32_t
 ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 7e09291b22..826f451bee 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev);
 uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev);
+int ixgbe_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _IXGBE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [PATCH v3 5/6] net/i40e: implement power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (2 preceding siblings ...)
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
@ 2020-09-04 10:18     ` " Liang Ma
  2020-09-04 10:19     ` [dpdk-dev] [PATCH v3 6/6] net/ice: " Liang Ma
                       ` (5 subsequent siblings)
  9 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-04 10:18 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/i40e/i40e_ethdev.c |  1 +
 drivers/net/i40e/i40e_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 11c02b1888..94e9298d7c 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -517,6 +517,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.mtu_set                      = i40e_dev_mtu_set,
 	.tm_ops_get                   = i40e_tm_ops_get,
 	.tx_done_cleanup              = i40e_tx_done_cleanup,
+	.next_rx_desc	              = i40e_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index fe7f9200c1..9d7eea8aed 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -71,6 +71,29 @@
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+int
+i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.qword1.status_error_len;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	*mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+
+	return 0;
+}
+
 static inline void
 i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp)
 {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 57d7b4160b..bfda5b6ad3 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue,
 	struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_pkts);
+int i40e_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *value);
 
 /* For each value it means, datasheet of hardware can tell more details
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [dpdk-dev] [PATCH v3 6/6] net/ice: implement power management API
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (3 preceding siblings ...)
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 5/6] net/i40e: " Liang Ma
@ 2020-09-04 10:19     ` " Liang Ma
  2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
                       ` (4 subsequent siblings)
  9 siblings, 0 replies; 56+ messages in thread
From: Liang Ma @ 2020-09-04 10:19 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, anatoly.burakov, Liang Ma

Implement support for the power management API by implementing a
`next_rx_desc` function that will return an address of an RX ring's
status bit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ice/ice_ethdev.c |  1 +
 drivers/net/ice/ice_rxtx.c   | 23 +++++++++++++++++++++++
 drivers/net/ice/ice_rxtx.h   |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 8d435e8892..7d7e1dcbac 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -212,6 +212,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
 	.udp_tunnel_port_add          = ice_dev_udp_tunnel_port_add,
 	.udp_tunnel_port_del          = ice_dev_udp_tunnel_port_del,
 	.tx_done_cleanup              = ice_tx_done_cleanup,
+	.next_rx_desc	              = ice_next_rx_desc,
 };
 
 /* store statistics names and its offset in stats structure */
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 2e1f06d2c0..c043181ceb 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -24,6 +24,29 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask)
+{
+	volatile union ice_rx_flex_desc *rxdp;
+	struct ice_rx_queue *rxq = rx_queue;
+	uint16_t desc;
+
+	desc = rxq->rx_tail;
+	rxdp = &rxq->rx_ring[desc];
+	/* watch for changes in status bit */
+	*tail_desc_addr = &rxdp->wb.status_error0;
+
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	*expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	*mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+
+	return 0;
+}
+
+
 static inline uint64_t
 ice_rxdid_to_proto_xtr_ol_flag(uint8_t rxdid)
 {
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index 2fdcfb7d04..7eb6fa904e 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -202,5 +202,7 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
 				uint16_t nb_pkts);
 int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
 int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
+int ice_next_rx_desc(void *rx_queue, volatile void **tail_desc_addr,
+		uint64_t *expected, uint64_t *mask);
 
 #endif /* _ICE_RXTX_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (4 preceding siblings ...)
  2020-09-04 10:19     ` [dpdk-dev] [PATCH v3 6/6] net/ice: " Liang Ma
@ 2020-09-04 16:23     ` Stephen Hemminger
  2020-09-14 20:48       ` Liang, Ma
  2020-09-04 16:37     ` Stephen Hemminger
                       ` (3 subsequent siblings)
  9 siblings, 1 reply; 56+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:23 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:55 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp);

Since this is generic code, and you are defining the function.
You should have it return -ENOTSUPPORTED or -EINVAL.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
@ 2020-09-04 16:36       ` Stephen Hemminger
  2020-09-14 20:52         ` Liang, Ma
  2020-09-04 18:33       ` Ananyev, Konstantin
  1 sibling, 1 reply; 56+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:36 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:57 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API is limited to 1 core 1 port 1 queue use case as there is
> no coordination between queues/cores in ethdev. 1 port map to multiple
> core will be supported in next version.

The common way to express is this is:

This API is not thread-safe and not preempt-safe.
There is also no mechanism for a single thread to wait on multiple queues.

> 
> This design leverage RX Callback mechnaism which allow three
> different power management methodology co exist.

nit coexist is one word

> 
> 1. umwait/umonitor:
> 
>    The TSC timestamp is automatically calculated using current
>    link speed and RX descriptor ring size, such that the sleep
>    time is not longer than it would take for a NIC to fill its
>    entire RX descriptor ring.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this lightweight
>    method use Pause instruction to releaf the processor from
>    busy polling.

Wording here is a problem, and "releaf" should be "relief"?
Rewording into active voice grammar would be easier.

     Use Pause instruction to allow processor to go into deeper C
     state when busy polling.



> 
> 3. Frequency Scaling
>    Reuse exist rte power library to scale up/down core frequency
>    depend on traffic volume.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
@ 2020-09-04 16:37       ` Stephen Hemminger
  2020-09-14 21:04         ` Liang, Ma
  2020-09-04 20:54       ` Ananyev, Konstantin
  1 sibling, 1 reply; 56+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:37 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:56 +0100
Liang Ma <liang.j.ma@intel.com> wrote:



> +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */

Spelling here.

Also, shouldn't this be a per-device (or per-queue) configuration value.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (5 preceding siblings ...)
  2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
@ 2020-09-04 16:37     ` Stephen Hemminger
  2020-09-14 20:49       ` Liang, Ma
  2020-09-04 18:42     ` Stephen Hemminger
                       ` (2 subsequent siblings)
  9 siblings, 1 reply; 56+ messages in thread
From: Stephen Hemminger @ 2020-09-04 16:37 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:55 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

This looks like a useful feature but needs more documentation and example.
It would make sense to put an example in l3fwd-power. 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
  2020-09-04 16:36       ` Stephen Hemminger
@ 2020-09-04 18:33       ` Ananyev, Konstantin
  2020-09-14 21:01         ` Liang, Ma
  1 sibling, 1 reply; 56+ messages in thread
From: Ananyev, Konstantin @ 2020-09-04 18:33 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, Burakov, Anatoly, Ma, Liang J

> 
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API is limited to 1 core 1 port 1 queue use case as there is
> no coordination between queues/cores in ethdev. 1 port map to multiple
> core will be supported in next version.
> 
> This design leverage RX Callback mechnaism which allow three
> different power management methodology co exist.
> 
> 1. umwait/umonitor:
> 
>    The TSC timestamp is automatically calculated using current
>    link speed and RX descriptor ring size, such that the sleep
>    time is not longer than it would take for a NIC to fill its
>    entire RX descriptor ring.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this lightweight
>    method use Pause instruction to releaf the processor from
>    busy polling.
> 
> 3. Frequency Scaling
>    Reuse exist rte power library to scale up/down core frequency
>    depend on traffic volume.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_power/meson.build           |   3 +-
>  lib/librte_power/rte_power.h           |  38 +++++
>  lib/librte_power/rte_power_pmd_mgmt.c  | 184 +++++++++++++++++++++++++
>  lib/librte_power/rte_power_version.map |   4 +
>  4 files changed, 228 insertions(+), 1 deletion(-)
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
> 
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 78c031c943..44b01afce2 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
>  		'power_kvm_vm.c', 'guest_channel.c',
>  		'rte_power_empty_poll.c',
>  		'power_pstate_cpufreq.c',
> +		'rte_power_pmd_mgmt.c',
>  		'power_common.c')
>  headers = files('rte_power.h','rte_power_empty_poll.h')
> -deps += ['timer']
> +deps += ['timer' ,'ethdev']
> diff --git a/lib/librte_power/rte_power.h b/lib/librte_power/rte_power.h
> index bbbde4dfb4..06d5a9984f 100644
> --- a/lib/librte_power/rte_power.h
> +++ b/lib/librte_power/rte_power.h
> @@ -14,6 +14,7 @@
>  #include <rte_byteorder.h>
>  #include <rte_log.h>
>  #include <rte_string_fns.h>
> +#include <rte_ethdev.h>
> 
>  #ifdef __cplusplus
>  extern "C" {
> @@ -97,6 +98,43 @@ int rte_power_init(unsigned int lcore_id);
>   */
>  int rte_power_exit(unsigned int lcore_id);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable device power management.
> + * @param lcore_id
> + *   lcore id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param mode
> + *   The power management callback function mode.
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_power_pmd_mgmt_enable(unsigned int lcore_id,
> +				  uint16_t port_id,
> +				  enum rte_eth_dev_power_mgmt_cb_mode mode);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Disable device power management.
> + * @param lcore_id
> + *   lcore id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int rte_power_pmd_mgmt_disable(unsigned int lcore_id, uint16_t port_id);
> +
>  /**
>   * Get the available frequencies of a specific lcore.
>   * Function pointer definition. Review each environments
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
> new file mode 100644
> index 0000000000..a445153ede
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
> @@ -0,0 +1,184 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_atomic.h>
> +#include <rte_malloc.h>
> +#include <rte_ethdev.h>
> +
> +#include "rte_power.h"
> +
> +
> +
> +static uint16_t
> +rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		dev->empty_poll_stats[qidx].num++;

I believe there are two fundamental issues with that approach:
1. You put metadata specific lib (power) callbacks into rte_eth_dev struct.
2. These callbacks do access rte_eth_devices[] directly. 
That doesn't look right to me - rte_eth_dev structure supposed to be treated
as internal one librt_ether and underlying drivers and should be accessed directly
by outer code.
If these callbacks need some extra metadata, then it is responsibility
of power library to allocate/manage these metadata.
You can pass pointer to this metadata via last parameter for rte_eth_add_rx_callback().

> +		if (unlikely(dev->empty_poll_stats[qidx].num >
> +			     ETH_EMPTYPOLL_MAX)) {
> +			volatile void *target_addr;
> +			uint64_t expected, mask;
> +			uint16_t ret;
> +
> +			/*
> +			 * get address of next descriptor in the RX
> +			 * ring for this queue, as well as expected
> +			 * value and a mask.
> +			 */
> +			ret = (*dev->dev_ops->next_rx_desc)
> +				(dev->data->rx_queues[qidx],
> +				 &target_addr, &expected, &mask);
> +			if (ret == 0)
> +				/* -1ULL is maximum value for TSC */
> +				rte_power_monitor(target_addr,
> +						  expected, mask,
> +						  0, -1ULL);
> +		}
> +	} else
> +		dev->empty_poll_stats[qidx].num = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	int i;
> +
> +	if (unlikely(nb_rx == 0)) {
> +
> +		dev->empty_poll_stats[qidx].num++;
> +
> +		if (unlikely(dev->empty_poll_stats[qidx].num >
> +			     ETH_EMPTYPOLL_MAX)) {
> +
> +			for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
> +				rte_pause();
> +
> +		}
> +	} else
> +		dev->empty_poll_stats[qidx].num = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		dev->empty_poll_stats[qidx].num++;
> +		if (unlikely(dev->empty_poll_stats[qidx].num >
> +			     ETH_EMPTYPOLL_MAX)) {
> +
> +			/*scale down freq */
> +			rte_power_freq_min(rte_lcore_id());
> +
> +		}
> +	} else {
> +		dev->empty_poll_stats[qidx].num = 0;
> +		/* scal up freq */
> +		rte_power_freq_max(rte_lcore_id());
> +	}
> +
> +	return nb_rx;
> +}
> +
> +int
> +rte_power_pmd_mgmt_enable(unsigned int lcore_id,
> +			uint16_t port_id,
> +			enum rte_eth_dev_power_mgmt_cb_mode mode)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
> +		return -EINVAL;
> +	/* allocate memory for empty poll stats */
> +	dev->empty_poll_stats = rte_malloc_socket(NULL,
> +						  sizeof(struct rte_eth_ep_stat)
> +						  * RTE_MAX_QUEUES_PER_PORT,
> +						  0, dev->data->numa_node);
> +	if (dev->empty_poll_stats == NULL)
> +		return -ENOMEM;
> +
> +	switch (mode) {
> +	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> +		if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
> +			return -ENOTSUP;

Here and in other places: in case of error return you don't' free your empty_poll_stats.

> +		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,

Why zero for queue number, why not to pass queue_id as a parameter for that function?

> +						rte_power_mgmt_umwait, NULL);

As I said above, instead of NULL - could be pointer to metadata struct.

> +		break;
> +	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> +		/* init scale freq */
> +		if (rte_power_init(lcore_id))
> +			return -EINVAL;
> +		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> +					rte_power_mgmt_scalefreq, NULL);
> +		break;
> +	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> +		dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> +						rte_power_mgmt_pause, NULL);
> +		break;
> +	}
> +
> +	dev->cb_mode = mode;
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
> +	return 0;
> +}
> +
> +int
> +rte_power_pmd_mgmt_disable(unsigned int lcore_id,
> +				uint16_t port_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	/*add flag check */
> +
> +	if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
> +		return -EINVAL;
> +
> +	/* rte_free ignores NULL so safe to call without checks */
> +	rte_free(dev->empty_poll_stats);

You can't free callback metadata before removing the callback itself.
In fact, with current rx callback code it is not safe to free it
even after (we discussed it offline).

> +
> +	switch (dev->cb_mode) {
> +	case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> +	case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> +		rte_eth_remove_rx_callback(port_id, 0,
> +					   dev->cur_pwr_cb);
> +		break;
> +	case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> +		rte_power_freq_max(lcore_id);

Stupid q: what makes you think that lcore frequency was max,
*before* you setup the callback?

> +		rte_eth_remove_rx_callback(port_id, 0,
> +					   dev->cur_pwr_cb);
> +		if (rte_power_exit(lcore_id))
> +			return -EINVAL;
> +		break;
> +	}
> +
> +	dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> +	dev->cur_pwr_cb = NULL;
> +	dev->cb_mode = 0;
> +
> +	return 0;
> +}
> diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> index 00ee5753e2..ade83cfd4f 100644
> --- a/lib/librte_power/rte_power_version.map
> +++ b/lib/librte_power/rte_power_version.map
> @@ -34,4 +34,8 @@ EXPERIMENTAL {
>  	rte_power_guest_channel_receive_msg;
>  	rte_power_poll_stat_fetch;
>  	rte_power_poll_stat_update;
> +	# added in 20.08
> +	rte_power_pmd_mgmt_disable;
> +	rte_power_pmd_mgmt_enable;
> +
>  };
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (6 preceding siblings ...)
  2020-09-04 16:37     ` Stephen Hemminger
@ 2020-09-04 18:42     ` Stephen Hemminger
  2020-09-14 21:12       ` Liang, Ma
  2020-09-16 16:34       ` Liang, Ma
  2020-09-06 21:44     ` Ananyev, Konstantin
  2020-09-18  5:01     ` Jerin Jacob
  9 siblings, 2 replies; 56+ messages in thread
From: Stephen Hemminger @ 2020-09-04 18:42 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, anatoly.burakov

On Fri,  4 Sep 2020 11:18:55 +0100
Liang Ma <liang.j.ma@intel.com> wrote:

> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

Before this is merged, please work with Arm maintainers to have a version that
works on Arm 64 as well. Don't think this should be merged unless the two major
platforms supported by DPDK can work with it.

Also, not sure if this mechanism can work with other drivers. You need to
work with other vendors to show that the same infrastructure can work with
their hardware. Once again, I don't think this can go in if it only can
work on Intel.  It needs to work on Broadcom, Mellanox to be useful.

Will it work in a VM? Will it work with virtio or vmxnet3?

Having a single vendor solution is a non-starter for me.
They don't all have to be there to get it merged, but if the design only
works on single platform then it is not helpful.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
  2020-09-04 16:37       ` Stephen Hemminger
@ 2020-09-04 20:54       ` Ananyev, Konstantin
  1 sibling, 0 replies; 56+ messages in thread
From: Ananyev, Konstantin @ 2020-09-04 20:54 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, Burakov, Anatoly, Ma, Liang J

> Add a simple API allow ethdev get the last
> available queue descriptor address from PMD.
> Also include internal structure update.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_ethdev/rte_ethdev.h      | 22 ++++++++++++++
>  lib/librte_ethdev/rte_ethdev_core.h | 46 +++++++++++++++++++++++++++--
>  2 files changed, 66 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 70295d7ab7..d9312d3e11 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -157,6 +157,7 @@ extern "C" {
>  #include <rte_common.h>
>  #include <rte_config.h>
>  #include <rte_ether.h>
> +#include <rte_power_intrinsics.h>
> 
>  #include "rte_ethdev_trace_fp.h"
>  #include "rte_dev_info.h"
> @@ -775,6 +776,7 @@ rte_eth_rss_hf_refine(uint64_t rss_hf)
>  /** Maximum nb. of vlan per mirror rule */
>  #define ETH_MIRROR_MAX_VLANS       64
> 
> +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
>  #define ETH_MIRROR_VIRTUAL_POOL_UP     0x01  /**< Virtual Pool uplink Mirroring. */
>  #define ETH_MIRROR_UPLINK_PORT         0x02  /**< Uplink Port Mirroring. */
>  #define ETH_MIRROR_DOWNLINK_PORT       0x04  /**< Downlink Port Mirroring. */
> @@ -1602,6 +1604,26 @@ enum rte_eth_dev_state {
>  	RTE_ETH_DEV_REMOVED,
>  };
> 
> +#define  RTE_ETH_PAUSE_NUM  64    /* How many times to pause */
> +/**
> + * Possible power management states of an ethdev port.
> + */
> +enum rte_eth_dev_power_mgmt_state {
> +	/** Device power management is disabled. */
> +	RTE_ETH_DEV_POWER_MGMT_DISABLED = 0,
> +	/** Device power management is enabled. */
> +	RTE_ETH_DEV_POWER_MGMT_ENABLED,
> +};
> +
> +enum rte_eth_dev_power_mgmt_cb_mode {
> +	/** WAIT callback mode. */
> +	RTE_ETH_DEV_POWER_MGMT_CB_WAIT = 1,
> +	/** PAUSE callback mode. */
> +	RTE_ETH_DEV_POWER_MGMT_CB_PAUSE,
> +	/** Freq Scaling callback mode. */
> +	RTE_ETH_DEV_POWER_MGMT_CB_SCALE,
> +};
> +

I don't think we need to put all these power related
staff into rte_ethdev library.
rte_power or so, seems like much better place for it.

>  struct rte_eth_dev_sriov {
>  	uint8_t active;               /**< SRIOV is active with 16, 32 or 64 pools */
>  	uint8_t nb_q_per_pool;        /**< rx queue number per pool */
> diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
> index 32407dd418..16e54bb4e4 100644
> --- a/lib/librte_ethdev/rte_ethdev_core.h
> +++ b/lib/librte_ethdev/rte_ethdev_core.h
> @@ -603,6 +603,30 @@ typedef int (*eth_tx_hairpin_queue_setup_t)
>  	 uint16_t nb_tx_desc,
>  	 const struct rte_eth_hairpin_conf *hairpin_conf);
> 
> +/**
> + * @internal
> + * Get the next RX ring descriptor address.
> + *
> + * @param rxq
> + *   ethdev queue pointer.
> + * @param tail_desc_addr
> + *   the pointer point to descriptor address var.
> + * @param expected
> + *   the pointer point to value to be expected when descriptor is set.
> + * @param mask
> + *   the pointer point to comparison bitmask for the expected value.
> + * @return
> + *   Negative errno value on error, 0 on success.
> + *
> + * @retval 0
> + *   Success.
> + * @retval -EINVAL
> + *   Failed to get descriptor address.
> + */
> +typedef int (*eth_next_rx_desc_t)
> +	(void *rxq, volatile void **tail_desc_addr,
> +	 uint64_t *expected, uint64_t *mask);
> +

In theory it could be anything: next RXD, doorbell,
even some global variable.
So I think function name needs to be more neutral:
eth_rx_wait_addr() or so.
Also I think you need a new rte_eth_ wrapper function
for that dev op.  

>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -752,6 +776,8 @@ struct eth_dev_ops {
>  	/**< Set up device RX hairpin queue. */
>  	eth_tx_hairpin_queue_setup_t tx_hairpin_queue_setup;
>  	/**< Set up device TX hairpin queue. */
> +	eth_next_rx_desc_t next_rx_desc;
> +	/**< Get next RX ring descriptor address. */
>  };
> 
>  /**
> @@ -768,6 +794,14 @@ struct rte_eth_rxtx_callback {
>  	void *param;
>  };
> 
> +/**
> + * @internal
> + * Structure used to hold counters for empty poll
> + */
> +struct rte_eth_ep_stat {
> +	uint64_t num;
> +} __rte_cache_aligned;
> +
>  /**
>   * @internal
>   * The generic data structure associated with each ethernet device.
> @@ -807,8 +841,16 @@ struct rte_eth_dev {
>  	enum rte_eth_dev_state state; /**< Flag indicating the port state */
>  	void *security_ctx; /**< Context for security ops */
> 
> -	uint64_t reserved_64s[4]; /**< Reserved for future fields */
> -	void *reserved_ptrs[4];   /**< Reserved for future fields */
> +	/**< Empty poll number */
> +	enum rte_eth_dev_power_mgmt_state pwr_mgmt_state;
> +	/**< Power mgmt Callback mode */
> +	enum rte_eth_dev_power_mgmt_cb_mode cb_mode;
> +	uint64_t reserved_64s[3]; /**< Reserved for future fields */
> +
> +	/**< Flag indicating the port power state */
> +	struct rte_eth_ep_stat *empty_poll_stats;
> +	const struct rte_eth_rxtx_callback *cur_pwr_cb;
> +	void *reserved_ptrs[2];   /**< Reserved for future fields */
>  } __rte_cache_aligned;
> 
>  struct rte_eth_dev_sriov;
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (7 preceding siblings ...)
  2020-09-04 18:42     ` Stephen Hemminger
@ 2020-09-06 21:44     ` Ananyev, Konstantin
  2020-09-18  5:01     ` Jerin Jacob
  9 siblings, 0 replies; 56+ messages in thread
From: Ananyev, Konstantin @ 2020-09-06 21:44 UTC (permalink / raw)
  To: Ma, Liang J, dev; +Cc: Hunt, David, Burakov, Anatoly, Ma, Liang J


> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..6dd1cdc939
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_

As a nit - if the function below are supported for both 64 and 32 ISA,
then probably: RTE_POWER_INTRINSIC_X86_H_


> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +
> +	if (value_mask) {
> +		const uint64_t cur_value = *(const volatile uint64_t *)p;
> +		const uint64_t masked = cur_value & value_mask;
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return 0;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
> +		/*
> +		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
> +		 * onto the stack, then pop them back into `rflags` so that
> +		 * we can read it.
> +		 */
> +		"pushf;\n"
> +		"pop %0;\n"
> +		: "=r"(rflags)
> +		: "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction. For more information about its usage,
> + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
> + * Manual.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
> +		     /*
> +		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
> +		      * onto the stack, then pop them back into `rflags` so that
> +		      * we can read it.
> +		      */
> +		     "pushf;\n"
> +		     "pop %0;\n"
> +		     : "=r"(rflags)
> +		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
@ 2020-09-14 20:48       ` Liang, Ma
  0 siblings, 0 replies; 56+ messages in thread
From: Liang, Ma @ 2020-09-14 20:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

Hi Stephen, 
   Agree. v4 will address this.
Regards
Liang
On 04 Sep 09:23, Stephen Hemminger wrote:
> On Fri,  4 Sep 2020 11:18:55 +0100
> Liang Ma <liang.j.ma@intel.com> wrote:
> 
> > + *
> > + * @return
> > + *   Architecture-dependent return value.
> > + */
> > +static inline int rte_power_monitor(const volatile void *p,
> > +		const uint64_t expected_value, const uint64_t value_mask,
> > +		const uint32_t state, const uint64_t tsc_timestamp);
> 
> Since this is generic code, and you are defining the function.
> You should have it return -ENOTSUPPORTED or -EINVAL.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 16:37     ` Stephen Hemminger
@ 2020-09-14 20:49       ` Liang, Ma
  0 siblings, 0 replies; 56+ messages in thread
From: Liang, Ma @ 2020-09-14 20:49 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

Hi Stephen, 
   v4 patch will include the l3fwd-power udpate.
Regards
Liang

On 04 Sep 09:37, Stephen Hemminger wrote:
> On Fri,  4 Sep 2020 11:18:55 +0100
> Liang Ma <liang.j.ma@intel.com> wrote:
> 
> > Add two new power management intrinsics, and provide an implementation
> > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > are implemented as raw byte opcodes because there is not yet widespread
> > compiler support for these instructions.
> > 
> > The power management instructions provide an architecture-specific
> > function to either wait until a specified TSC timestamp is reached, or
> > optionally wait until either a TSC timestamp is reached or a memory
> > location is written to. The monitor function also provides an optional
> > comparison, to avoid sleeping when the expected write has already
> > happened, and no more writes are expected.
> > 
> > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> This looks like a useful feature but needs more documentation and example.
> It would make sense to put an example in l3fwd-power. 
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 16:36       ` Stephen Hemminger
@ 2020-09-14 20:52         ` Liang, Ma
  0 siblings, 0 replies; 56+ messages in thread
From: Liang, Ma @ 2020-09-14 20:52 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

<snip>
Hi Stephen, 
    v4 will support 1 port with multiple core(still 1 queue per core)support
    this part description will be updated according to the design change.
Regards
Liang
> The common way to express is this is:
> 
> This API is not thread-safe and not preempt-safe.
> There is also no mechanism for a single thread to wait on multiple queues.
> 
> > 
> > This design leverage RX Callback mechnaism which allow three
> > different power management methodology co exist.
> 
> nit coexist is one word
> 
> > 
> > 1. umwait/umonitor:
> > 
> >    The TSC timestamp is automatically calculated using current
> >    link speed and RX descriptor ring size, such that the sleep
> >    time is not longer than it would take for a NIC to fill its
> >    entire RX descriptor ring.
> > 
> > 2. Pause instruction
> > 
> >    Instead of move the core into deeper C state, this lightweight
> >    method use Pause instruction to releaf the processor from
> >    busy polling.
> 
> Wording here is a problem, and "releaf" should be "relief"?
> Rewording into active voice grammar would be easier.
> 
>      Use Pause instruction to allow processor to go into deeper C
>      state when busy polling.
> 
> 
> 
> > 
> > 3. Frequency Scaling
> >    Reuse exist rte power library to scale up/down core frequency
> >    depend on traffic volume.
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-04 18:33       ` Ananyev, Konstantin
@ 2020-09-14 21:01         ` Liang, Ma
  2020-09-16 14:53           ` Ananyev, Konstantin
  0 siblings, 1 reply; 56+ messages in thread
From: Liang, Ma @ 2020-09-14 21:01 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Hunt, David, Burakov, Anatoly

On 04 Sep 11:33, Ananyev, Konstantin wrote:

<snip>
> > +struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +if (unlikely(nb_rx == 0)) {
> > +dev->empty_poll_stats[qidx].num++;
>
Hi Konstantin,
   Agree, v4 will relocate the meta data to seperate structure. 
   and without touch the rte_ethdev structure. 
> I believe there are two fundamental issues with that approach:
> 1. You put metadata specific lib (power) callbacks into rte_eth_dev struct.
> 2. These callbacks do access rte_eth_devices[] directly.
> That doesn't look right to me - rte_eth_dev structure supposed to be treated
> as internal one librt_ether and underlying drivers and should be accessed directly
> by outer code.
> If these callbacks need some extra metadata, then it is responsibility
> of power library to allocate/manage these metadata.
> You can pass pointer to this metadata via last parameter for rte_eth_add_rx_callback().
> 
> > +if (unlikely(dev->empty_poll_stats[qidx].num >
> > +     ETH_EMPTYPOLL_MAX)) {
> > +volatile void *target_addr;
> > +uint64_t expected, mask;
> > +uint16_t ret;
> > +
> > +/*
> > + * get address of next descriptor in the RX
> > + * ring for this queue, as well as expected
> > + * value and a mask.
> > + */
> > +ret = (*dev->dev_ops->next_rx_desc)
> > +(dev->data->rx_queues[qidx],
> > + &target_addr, &expected, &mask);
> > +if (ret == 0)
> > +/* -1ULL is maximum value for TSC */
> > +rte_power_monitor(target_addr,
> > +  expected, mask,
> > +  0, -1ULL);
> > +}
> > +} else
> > +dev->empty_poll_stats[qidx].num = 0;
> > +
> > +return nb_rx;
> > +}
> > +
> > +static uint16_t
> > +rte_power_mgmt_pause(uint16_t port_id, uint16_t qidx,
> > +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> > +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> > +{
> > +struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +int i;
> > +
> > +if (unlikely(nb_rx == 0)) {
> > +
> > +dev->empty_poll_stats[qidx].num++;
> > +
> > +if (unlikely(dev->empty_poll_stats[qidx].num >
> > +     ETH_EMPTYPOLL_MAX)) {
> > +
> > +for (i = 0; i < RTE_ETH_PAUSE_NUM; i++)
> > +rte_pause();
> > +
> > +}
> > +} else
> > +dev->empty_poll_stats[qidx].num = 0;
> > +
> > +return nb_rx;
> > +}
> > +
> > +static uint16_t
> > +rte_power_mgmt_scalefreq(uint16_t port_id, uint16_t qidx,
> > +struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> > +uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> > +{
> > +struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +if (unlikely(nb_rx == 0)) {
> > +dev->empty_poll_stats[qidx].num++;
> > +if (unlikely(dev->empty_poll_stats[qidx].num >
> > +     ETH_EMPTYPOLL_MAX)) {
> > +
> > +/*scale down freq */
> > +rte_power_freq_min(rte_lcore_id());
> > +
> > +}
> > +} else {
> > +dev->empty_poll_stats[qidx].num = 0;
> > +/* scal up freq */
> > +rte_power_freq_max(rte_lcore_id());
> > +}
> > +
> > +return nb_rx;
> > +}
> > +
> > +int
> > +rte_power_pmd_mgmt_enable(unsigned int lcore_id,
> > +uint16_t port_id,
> > +enum rte_eth_dev_power_mgmt_cb_mode mode)
> > +{
> > +struct rte_eth_dev *dev;
> > +
> > +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> > +dev = &rte_eth_devices[port_id];
> > +
> > +if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_ENABLED)
> > +return -EINVAL;
> > +/* allocate memory for empty poll stats */
> > +dev->empty_poll_stats = rte_malloc_socket(NULL,
> > +  sizeof(struct rte_eth_ep_stat)
> > +  * RTE_MAX_QUEUES_PER_PORT,
> > +  0, dev->data->numa_node);
> > +if (dev->empty_poll_stats == NULL)
> > +return -ENOMEM;
> > +
> > +switch (mode) {
> > +case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> > +if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG))
> > +return -ENOTSUP;
> 
> Here and in other places: in case of error return you don't' free your empty_poll_stats.
> 
> > +dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> 
> Why zero for queue number, why not to pass queue_id as a parameter for that function?
v4 will move to use queue_id instead of 0. v3 still assume only queue 0 is used.
> 
> > +rte_power_mgmt_umwait, NULL);
> 
> As I said above, instead of NULL - could be pointer to metadata struct.
v4 will address this. 
> 
> > +break;
> > +case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> > +/* init scale freq */
> > +if (rte_power_init(lcore_id))
> > +return -EINVAL;
> > +dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> > +rte_power_mgmt_scalefreq, NULL);
> > +break;
> > +case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> > +dev->cur_pwr_cb = rte_eth_add_rx_callback(port_id, 0,
> > +rte_power_mgmt_pause, NULL);
> > +break;
> > +}
> > +
> > +dev->cb_mode = mode;
> > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_ENABLED;
> > +return 0;
> > +}
> > +
> > +int
> > +rte_power_pmd_mgmt_disable(unsigned int lcore_id,
> > +uint16_t port_id)
> > +{
> > +struct rte_eth_dev *dev;
> > +
> > +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> > +dev = &rte_eth_devices[port_id];
> > +
> > +/*add flag check */
> > +
> > +if (dev->pwr_mgmt_state == RTE_ETH_DEV_POWER_MGMT_DISABLED)
> > +return -EINVAL;
> > +
> > +/* rte_free ignores NULL so safe to call without checks */
> > +rte_free(dev->empty_poll_stats);
> 
> You can't free callback metadata before removing the callback itself.
> In fact, with current rx callback code it is not safe to free it
> even after (we discussed it offline).
agree. 
> 
> > +
> > +switch (dev->cb_mode) {
> > +case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> > +case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> > +rte_eth_remove_rx_callback(port_id, 0,
> > +   dev->cur_pwr_cb);
> > +break;
> > +case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> > +rte_power_freq_max(lcore_id);
> 
> Stupid q: what makes you think that lcore frequency was max,
> *before* you setup the callback?
that is because the rte_power_init() has figured out the system max.
the init code invocate rte_power_init() already. 
> 
> > +rte_eth_remove_rx_callback(port_id, 0,
> > +   dev->cur_pwr_cb);
> > +if (rte_power_exit(lcore_id))
> > +return -EINVAL;
> > +break;
> > +}
> > +
> > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > +dev->cur_pwr_cb = NULL;
> > +dev->cb_mode = 0;
> > +
> > +return 0;
> > +}
> > diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> > index 00ee5753e2..ade83cfd4f 100644
> > --- a/lib/librte_power/rte_power_version.map
> > +++ b/lib/librte_power/rte_power_version.map
> > @@ -34,4 +34,8 @@ EXPERIMENTAL {
> >  rte_power_guest_channel_receive_msg;
> >  rte_power_poll_stat_fetch;
> >  rte_power_poll_stat_update;
> > +# added in 20.08
> > +rte_power_pmd_mgmt_disable;
> > +rte_power_pmd_mgmt_enable;
> > +
> >  };
> > --
> > 2.17.1
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API
  2020-09-04 16:37       ` Stephen Hemminger
@ 2020-09-14 21:04         ` Liang, Ma
  0 siblings, 0 replies; 56+ messages in thread
From: Liang, Ma @ 2020-09-14 21:04 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

agree, will be addressed 
On 04 Sep 09:37, Stephen Hemminger wrote:
> On Fri,  4 Sep 2020 11:18:56 +0100
> Liang Ma <liang.j.ma@intel.com> wrote:
> 
> 
> 
> > +#define ETH_EMPTYPOLL_MAX          512 /**< Empty poll number threshlold */
> 
> Spelling here.
> 
> Also, shouldn't this be a per-device (or per-queue) configuration value.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 18:42     ` Stephen Hemminger
@ 2020-09-14 21:12       ` Liang, Ma
  2020-09-16 16:34       ` Liang, Ma
  1 sibling, 0 replies; 56+ messages in thread
From: Liang, Ma @ 2020-09-14 21:12 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

On 04 Sep 11:42, Stephen Hemminger wrote:
<snip>
we are very open to discuss design  with other vendor.
> Before this is merged, please work with Arm maintainers to have a version that
> works on Arm 64 as well. Don't think this should be merged unless the two major
> platforms supported by DPDK can work with it. 

> Also, not sure if this mechanism can work with other drivers. You need to
> work with other vendors to show that the same infrastructure can work with
> their hardware. Once again, I don't think this can go in if it only can
> work on Intel.  It needs to work on Broadcom, Mellanox to be useful.
this mechanism should work with any device use a HW ring descriptor mechanism. 
I think most Mellanox and Broadcom NIC can support it easily. 

> Will it work in a VM? Will it work with virtio or vmxnet3?
> 
General speaking, Guest OS is not very easy to use this.
However, virtio is under invetigation.
> Having a single vendor solution is a non-starter for me.
> They don't all have to be there to get it merged, but if the design only
> works on single platform then it is not helpful.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-14 21:01         ` Liang, Ma
@ 2020-09-16 14:53           ` Ananyev, Konstantin
  2020-09-16 16:39             ` Liang, Ma
  0 siblings, 1 reply; 56+ messages in thread
From: Ananyev, Konstantin @ 2020-09-16 14:53 UTC (permalink / raw)
  To: Ma, Liang J; +Cc: dev, Hunt, David, Burakov, Anatoly



> >
> > > +
> > > +switch (dev->cb_mode) {
> > > +case RTE_ETH_DEV_POWER_MGMT_CB_WAIT:
> > > +case RTE_ETH_DEV_POWER_MGMT_CB_PAUSE:
> > > +rte_eth_remove_rx_callback(port_id, 0,
> > > +   dev->cur_pwr_cb);
> > > +break;
> > > +case RTE_ETH_DEV_POWER_MGMT_CB_SCALE:
> > > +rte_power_freq_max(lcore_id);
> >
> > Stupid q: what makes you think that lcore frequency was max,
> > *before* you setup the callback?
> that is because the rte_power_init() has figured out the system max.
> the init code invocate rte_power_init() already.

So rte_power_init(lcore) always raises lcore frequency to
max possible value?

> >
> > > +rte_eth_remove_rx_callback(port_id, 0,
> > > +   dev->cur_pwr_cb);
> > > +if (rte_power_exit(lcore_id))
> > > +return -EINVAL;
> > > +break;
> > > +}
> > > +
> > > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > > +dev->cur_pwr_cb = NULL;
> > > +dev->cb_mode = 0;
> > > +
> > > +return 0;
> > > +}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 18:42     ` Stephen Hemminger
  2020-09-14 21:12       ` Liang, Ma
@ 2020-09-16 16:34       ` Liang, Ma
  1 sibling, 0 replies; 56+ messages in thread
From: Liang, Ma @ 2020-09-16 16:34 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, david.hunt, anatoly.burakov

On 04 Sep 11:42, Stephen Hemminger wrote:
<snip> 

we have discussed with arm developer in the past.
Please ref https://patches.dpdk.org/patch/70662/
There was no objection, in my opinoin.
Also the API we proposed has experimental tag, other vendor still can change it.

For the ethdev internal ops we introduced should work with any NIC use ring descriptor
writeback mechansim. But we lack the internal sight of Mellanox or Broadcom NIC. 

AF_XDP PMD and virtio-net is under investigation. 

I hope above explaination addressed your concern. 

> Before this is merged, please work with Arm maintainers to have a version that
> works on Arm 64 as well. Don't think this should be merged unless the two major
> platforms supported by DPDK can work with it.
> 
> Also, not sure if this mechanism can work with other drivers. You need to
> work with other vendors to show that the same infrastructure can work with
> their hardware. Once again, I don't think this can go in if it only can
> work on Intel.  It needs to work on Broadcom, Mellanox to be useful.
> 
> Will it work in a VM? Will it work with virtio or vmxnet3?
> 
> Having a single vendor solution is a non-starter for me.
> They don't all have to be there to get it merged, but if the design only
> works on single platform then it is not helpful.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-16 14:53           ` Ananyev, Konstantin
@ 2020-09-16 16:39             ` Liang, Ma
  2020-09-16 16:44               ` Ananyev, Konstantin
  0 siblings, 1 reply; 56+ messages in thread
From: Liang, Ma @ 2020-09-16 16:39 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Hunt, David, Burakov, Anatoly

On 16 Sep 07:53, Ananyev, Konstantin wrote:
<snip>
Yes. we only has two gear. min or max. However, user still can customize
their system max with power mgmt python script on Intel platform. 
> So rte_power_init(lcore) always raises lcore frequency to
> max possible value?
> 
> > >
> > > > +rte_eth_remove_rx_callback(port_id, 0,
> > > > +   dev->cur_pwr_cb);
> > > > +if (rte_power_exit(lcore_id))
> > > > +return -EINVAL;
> > > > +break;
> > > > +}
> > > > +
> > > > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > > > +dev->cur_pwr_cb = NULL;
> > > > +dev->cb_mode = 0;
> > > > +
> > > > +return 0;
> > > > +}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback
  2020-09-16 16:39             ` Liang, Ma
@ 2020-09-16 16:44               ` Ananyev, Konstantin
  0 siblings, 0 replies; 56+ messages in thread
From: Ananyev, Konstantin @ 2020-09-16 16:44 UTC (permalink / raw)
  To: Ma, Liang J; +Cc: dev, Hunt, David, Burakov, Anatoly

> On 16 Sep 07:53, Ananyev, Konstantin wrote:
> <snip>
> Yes. we only has two gear. min or max. However, user still can customize
> their system max with power mgmt python script on Intel platform.

Ok, thanks for explanation.

> > So rte_power_init(lcore) always raises lcore frequency to
> > max possible value?
> >
> > > >
> > > > > +rte_eth_remove_rx_callback(port_id, 0,
> > > > > +   dev->cur_pwr_cb);
> > > > > +if (rte_power_exit(lcore_id))
> > > > > +return -EINVAL;
> > > > > +break;
> > > > > +}
> > > > > +
> > > > > +dev->pwr_mgmt_state = RTE_ETH_DEV_POWER_MGMT_DISABLED;
> > > > > +dev->cur_pwr_cb = NULL;
> > > > > +dev->cb_mode = 0;
> > > > > +
> > > > > +return 0;
> > > > > +}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics
  2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
                       ` (8 preceding siblings ...)
  2020-09-06 21:44     ` Ananyev, Konstantin
@ 2020-09-18  5:01     ` Jerin Jacob
  9 siblings, 0 replies; 56+ messages in thread
From: Jerin Jacob @ 2020-09-18  5:01 UTC (permalink / raw)
  To: Liang Ma, Honnappa Nagarahalli, Stephen Hemminger
  Cc: dpdk-dev, David Hunt, Anatoly Burakov, Richardson, Bruce,
	Ananyev, Konstantin, Thomas Monjalon

On Fri, Sep 4, 2020 at 3:49 PM Liang Ma <liang.j.ma@intel.com> wrote:
>
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
>
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>


> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
[snip]
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +               const uint64_t expected_value, const uint64_t value_mask,
> +               const uint32_t state, const uint64_t tsc_timestamp)

IMO, We must introduce some arch feature-capability _get_ scheme to tell
the consumer of this API is only supported on x86. Probably as functions[1]
or macro flags scheme and have a stub for the other architectures as the
API marked as generic ie rte_power_* not rte_x86_..

This will help the consumer to create workers based on the instruction features
which can NOT be abstracted as a generic feature across the architectures.


[1]
struct rte_arch_inst_feat {
        uint32_t power_monitor      : 1;  /**< Power monitor */
...
}

void rte_arch_inst_feat_get(struct rte_arch_inst_feat *feat);

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, back to index

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov
2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov
2020-05-28 11:39   ` Ananyev, Konstantin
2020-05-28 14:40     ` Burakov, Anatoly
2020-05-28 14:58       ` Bruce Richardson
2020-05-28 15:38       ` Ananyev, Konstantin
2020-05-29  6:56         ` Jerin Jacob
2020-06-02 10:15           ` Ananyev, Konstantin
2020-06-03  6:22           ` Honnappa Nagarahalli
2020-06-03  6:31             ` Jerin Jacob
2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov
2020-05-28 12:15   ` Ananyev, Konstantin
2020-05-27 17:02 ` [dpdk-dev] [RFC 3/6] net/ixgbe: implement " Anatoly Burakov
2020-05-27 17:02 ` [dpdk-dev] [RFC 4/6] net/i40e: " Anatoly Burakov
2020-05-27 17:02 ` [dpdk-dev] [RFC 5/6] net/ice: " Anatoly Burakov
2020-05-27 17:02 ` [dpdk-dev] [RFC 6/6] app/testpmd: add command for power management on a port Anatoly Burakov
2020-05-27 17:33 ` [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Jerin Jacob
2020-05-27 20:57   ` Stephen Hemminger
2020-08-11 10:27 ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang Ma
2020-08-11 10:27   ` [dpdk-dev] [RFC v2 2/5] ethdev: add simple power management API and callback Liang Ma
2020-08-13 18:11     ` Liang, Ma
2020-08-11 10:27   ` [dpdk-dev] [RFC v2 3/5] net/ixgbe: implement power management API Liang Ma
2020-08-11 10:27   ` [dpdk-dev] [RFC v2 4/5] net/i40e: " Liang Ma
2020-08-11 10:27   ` [dpdk-dev] [RFC v2 5/5] net/ice: " Liang Ma
2020-08-13 18:04   ` [dpdk-dev] [RFC v2 1/5] eal: add power management intrinsics Liang, Ma
2020-09-03 16:06   ` [dpdk-dev] [RFC PATCH v3 1/6] " Liang Ma
2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 2/6] ethdev: add simple power management API Liang Ma
2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 3/6] power: add simple power management API and callback Liang Ma
2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 5/6] net/i40e: " Liang Ma
2020-09-03 16:07     ` [dpdk-dev] [RFC PATCH v3 6/6] net/ice: " Liang Ma
2020-09-04 10:18   ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Liang Ma
2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 2/6] ethdev: add simple power management API Liang Ma
2020-09-04 16:37       ` Stephen Hemminger
2020-09-14 21:04         ` Liang, Ma
2020-09-04 20:54       ` Ananyev, Konstantin
2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 3/6] power: add simple power management API and callback Liang Ma
2020-09-04 16:36       ` Stephen Hemminger
2020-09-14 20:52         ` Liang, Ma
2020-09-04 18:33       ` Ananyev, Konstantin
2020-09-14 21:01         ` Liang, Ma
2020-09-16 14:53           ` Ananyev, Konstantin
2020-09-16 16:39             ` Liang, Ma
2020-09-16 16:44               ` Ananyev, Konstantin
2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 4/6] net/ixgbe: implement power management API Liang Ma
2020-09-04 10:18     ` [dpdk-dev] [PATCH v3 5/6] net/i40e: " Liang Ma
2020-09-04 10:19     ` [dpdk-dev] [PATCH v3 6/6] net/ice: " Liang Ma
2020-09-04 16:23     ` [dpdk-dev] [PATCH v3 1/6] eal: add power management intrinsics Stephen Hemminger
2020-09-14 20:48       ` Liang, Ma
2020-09-04 16:37     ` Stephen Hemminger
2020-09-14 20:49       ` Liang, Ma
2020-09-04 18:42     ` Stephen Hemminger
2020-09-14 21:12       ` Liang, Ma
2020-09-16 16:34       ` Liang, Ma
2020-09-06 21:44     ` Ananyev, Konstantin
2020-09-18  5:01     ` Jerin Jacob

DPDK patches and discussions

Archives are clonable:
	git clone --mirror http://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ http://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev


Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox