* [dpdk-dev] [PATCH v4 1/7] power_intrinsics: use callbacks for comparison
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-06-28 15:54 ` Anatoly Burakov
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 2/7] net/af_xdp: add power monitor support Anatoly Burakov
` (6 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.
This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.
Existing implementations are adjusted to follow the new semantics.
Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
Notes:
v4:
- Return error if callback is set to NULL
- Replace raw number with a macro in monitor condition opaque data
v2:
- Use callback mechanism for more flexibility
- Address feedback from Konstantin
doc/guides/rel_notes/release_21_08.rst | 1 +
drivers/event/dlb2/dlb2.c | 17 ++++++++--
drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
drivers/net/ice/ice_rxtx.c | 20 +++++++----
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
.../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
9 files changed, 121 insertions(+), 44 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index a6ecfdf3ce..c84ac280f5 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -84,6 +84,7 @@ API Changes
Also, make sure to start the actual text at the margin.
=======================================================
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
ABI Changes
-----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
}
}
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ /* abort if the value matches */
+ return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
static inline int
dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
expected_value = 0;
pmc.addr = monitor_addr;
- pmc.val = expected_value;
- pmc.mask = qe_mask.raw_qe[1];
+ /* store expected value and comparison mask in opaque data */
+ pmc.opaque[CLB_VAL_IDX] = expected_value;
+ pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+ /* set up callback */
+ pmc.fn = dlb2_monitor_callback;
pmc.size = sizeof(uint64_t);
rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decece..081682f88b 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
#define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
+static int
+i40e_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = i40e_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index 0361af0d85..7ed196ec22 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
}
+static int
+iavf_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = iavf_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index fc9bb5a3e7..d12437d19d 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
+static int
+ice_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.status_error0;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
- pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /* comparison callback */
+ pmc->fn = ice_monitor_callback;
/* register is 16-bit */
pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
};
+static int
+ixgbe_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.upper.status_error;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
- pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /* comparison callback */
+ pmc->fn = ixgbe_monitor_callback;
/* the registers are 32-bit */
pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..17370b77dc 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
return rx_queue_count(rxq);
}
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t m = opaque[CLB_MSK_IDX];
+ const uint64_t v = opaque[CLB_VAL_IDX];
+
+ return (value & m) == v ? -1 : 0;
+}
+
int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
return -rte_errno;
}
pmc->addr = &cqe->op_own;
- pmc->val = !!idx;
- pmc->mask = MLX5_CQE_OWNER_MASK;
+ pmc->opaque[CLB_VAL_IDX] = !!idx;
+ pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+ pmc->fn = mlx_monitor_callback;
pmc->size = sizeof(uint8_t);
return 0;
}
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
* which are architecture-dependent.
*/
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ * The value read from memory.
+ * @param opaque
+ * Callback-specific data.
+ *
+ * @return
+ * 0 if entering of power optimized state should proceed
+ * -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
struct rte_power_monitor_cond {
volatile void *addr; /**< Address to monitor for changes */
- uint64_t val; /**< If the `mask` is non-zero, location pointed
- * to by `addr` will be read and compared
- * against this value.
- */
- uint64_t mask; /**< 64-bit mask to extract value read from `addr` */
- uint8_t size; /**< Data size (in bytes) that will be used to compare
- * expected value (`val`) with data read from the
+ uint8_t size; /**< Data size (in bytes) that will be read from the
* monitored memory location (`addr`). Can be 1, 2,
* 4, or 8. Supplying any other value will result in
* an error.
*/
+ rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+ * entering power optimized state should
+ * be aborted.
+ */
+ uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+ /**< Callback-specific data */
};
/**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
const unsigned int lcore_id = rte_lcore_id();
struct power_wait_status *s;
+ uint64_t cur_value;
/* prevent user from running this instruction if it's not supported */
if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
if (__check_val_size(pmc->size) < 0)
return -EINVAL;
+ if (pmc->fn == NULL)
+ return -EINVAL;
+
s = &wait_status[lcore_id];
/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
/* now that we've put this address into monitor, we can unlock */
rte_spinlock_unlock(&s->lock);
- /* if we have a comparison mask, we might not need to sleep at all */
- if (pmc->mask) {
- const uint64_t cur_value = __get_umwait_val(
- pmc->addr, pmc->size);
- const uint64_t masked = cur_value & pmc->mask;
+ cur_value = __get_umwait_val(pmc->addr, pmc->size);
- /* if the masked value is already matching, abort */
- if (masked == pmc->val)
- goto end;
- }
+ /* check if callback indicates we should abort */
+ if (pmc->fn(cur_value, pmc->opaque) != 0)
+ goto end;
/* execute UMWAIT */
asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v4 2/7] net/af_xdp: add power monitor support
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-06-28 15:54 ` Anatoly Burakov
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 3/7] eal: add power monitor for multiple events Anatoly Burakov
` (5 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt, konstantin.ananyev
Implement support for .get_monitor_addr in AF_XDP driver.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v2:
- Rewrite using the callback mechanism
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..7830d0c23a 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
#include <rte_malloc.h>
#include <rte_ring.h>
#include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
#include "compat.h"
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
return 0;
}
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t v = opaque[CLB_VAL_IDX];
+ const uint64_t m = (uint32_t)~0;
+
+ /* if the value has changed, abort entering power optimized state */
+ return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+ struct pkt_rx_queue *rxq = rx_queue;
+ unsigned int *prod = rxq->rx.producer;
+ const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+ /* watch for changes in producer ring */
+ pmc->addr = (void*)prod;
+
+ /* store current value */
+ pmc->opaque[CLB_VAL_IDX] = cur_val;
+ pmc->fn = eth_monitor_callback;
+
+ /* AF_XDP producer ring index is 32-bit */
+ pmc->size = sizeof(uint32_t);
+
+ return 0;
+}
+
static int
eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
{
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
.link_update = eth_link_update,
.stats_get = eth_stats_get,
.stats_reset = eth_stats_reset,
+ .get_monitor_addr = eth_get_monitor_addr
};
/** parse busy_budget argument */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v4 3/7] eal: add power monitor for multiple events
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-06-28 15:54 ` Anatoly Burakov
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
` (4 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
To: dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v4:
- Fixed bugs in accessing the monitor condition
- Abort on any monitor condition not having a defined callback
v2:
- Adapt to callback mechanism
doc/guides/rel_notes/release_21_08.rst | 2 +
lib/eal/arm/rte_power_intrinsics.c | 11 +++
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 35 +++++++++
lib/eal/ppc/rte_power_intrinsics.c | 11 +++
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++
8 files changed, 139 insertions(+)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index c84ac280f5..9d1cfac395 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -55,6 +55,8 @@ New Features
Also, make sure to start the actual text at the margin.
=======================================================
+* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+
Removed Items
-------------
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
/**< indicates support for rte_power_monitor function */
uint32_t power_pause : 1;
/**< indicates support for rte_power_pause function */
+ uint32_t power_monitor_multi : 1;
+ /**< indicates support for rte_power_monitor_multi function */
};
/**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
__rte_experimental
int rte_power_pause(const uint64_t tsc_timestamp);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ * supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ * Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ * An array of monitoring condition structures.
+ * @param num
+ * Length of the `pmc` array.
+ * @param tsc_timestamp
+ * Maximum TSC timestamp to wait for. Note that the wait behavior is
+ * architecture-dependent.
+ *
+ * @return
+ * 0 on success
+ * -EINVAL on invalid parameters
+ * -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp);
+
#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..4ccd5475d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
rte_version_release; # WINDOWS_NO_EXPORT
rte_version_suffix; # WINDOWS_NO_EXPORT
rte_version_year; # WINDOWS_NO_EXPORT
+
+ # added in 21.08
+ rte_power_monitor_multi; # WINDOWS_NO_EXPORT
};
INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
intrinsics->power_monitor = 1;
intrinsics->power_pause = 1;
+ if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+ intrinsics->power_monitor_multi = 1;
}
}
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
#include <rte_common.h>
#include <rte_lcore.h>
+#include <rte_rtm.h>
#include <rte_spinlock.h>
#include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
}
static bool wait_supported;
+static bool wait_multi_supported;
static inline uint64_t
__get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
if (i.power_monitor && i.power_pause)
wait_supported = 1;
+ if (i.power_monitor_multi)
+ wait_multi_supported = 1;
}
int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
* In this case, since we've already woken up, the "wakeup" was
* unneeded, and since T1 is still waiting on T2 releasing the lock, the
* wakeup address is still valid so it's perfectly safe to write it.
+ *
+ * For multi-monitor case, the act of locking will in itself trigger the
+ * wakeup, so no additional writes necessary.
*/
rte_spinlock_lock(&s->lock);
if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return 0;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ const unsigned int lcore_id = rte_lcore_id();
+ struct power_wait_status *s = &wait_status[lcore_id];
+ uint32_t i, rc;
+
+ /* check if supported */
+ if (!wait_multi_supported)
+ return -ENOTSUP;
+
+ if (pmc == NULL || num == 0)
+ return -EINVAL;
+
+ /* we are already inside transaction region, return */
+ if (rte_xtest() != 0)
+ return 0;
+
+ /* start new transaction region */
+ rc = rte_xbegin();
+
+ /* transaction abort, possible write to one of wait addresses */
+ if (rc != RTE_XBEGIN_STARTED)
+ return 0;
+
+ /*
+ * the mere act of reading the lock status here adds the lock to
+ * the read set. This means that when we trigger a wakeup from another
+ * thread, even if we don't have a defined wakeup address and thus don't
+ * actually cause any writes, the act of locking our lock will itself
+ * trigger the wakeup and abort the transaction.
+ */
+ rte_spinlock_is_locked(&s->lock);
+
+ /*
+ * add all addresses to wait on into transaction read-set and check if
+ * any of wakeup conditions are already met.
+ */
+ rc = 0;
+ for (i = 0; i < num; i++) {
+ const struct rte_power_monitor_cond *c = &pmc[i];
+
+ /* cannot be NULL */
+ if (c->fn == NULL) {
+ rc = -EINVAL;
+ break;
+ }
+
+ const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+ /* abort if callback indicates that we need to stop */
+ if (c->fn(val, c->opaque) != 0)
+ break;
+ }
+
+ /* none of the conditions were met, sleep until timeout */
+ if (i == num)
+ rte_power_pause(tsc_timestamp);
+
+ /* end transaction region */
+ rte_xend();
+
+ return rc;
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v4 4/7] power: remove thread safety from PMD power API's
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
` (2 preceding siblings ...)
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-06-28 15:54 ` Anatoly Burakov
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
` (3 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.
We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v2:
- Add check for stopped queue
- Clarified doc message
- Added release notes
doc/guides/rel_notes/release_21_08.rst | 5 +
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 133 ++++++++++---------------
lib/power/rte_power_pmd_mgmt.h | 6 ++
4 files changed, 67 insertions(+), 80 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 9d1cfac395..f015c509fc 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -88,6 +88,11 @@ API Changes
* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+* rte_power: The experimental PMD power management API is no longer considered
+ to be thread safe; all Rx queues affected by the API will now need to be
+ stopped before making any changes to the power management scheme.
+
+
ABI Changes
-----------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
'rte_power_pmd_mgmt.h',
'rte_power_guest_channel.h',
)
+if cc.has_argument('-Wno-cast-qual')
+ cflags += '-Wno-cast-qual'
+endif
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
/**< Callback mode for this queue */
const struct rte_eth_rxtx_callback *cur_cb;
/**< Callback instance */
- volatile bool umwait_in_progress;
- /**< are we currently sleeping? */
uint64_t empty_poll_stats;
/**< Number of empty polls */
} __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
struct rte_power_monitor_cond pmc;
uint16_t ret;
- /*
- * we might get a cancellation request while being
- * inside the callback, in which case the wakeup
- * wouldn't work because it would've arrived too early.
- *
- * to get around this, we notify the other thread that
- * we're sleeping, so that it can spin until we're done.
- * unsolicited wakeups are perfectly safe.
- */
- q_conf->umwait_in_progress = true;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- /* check if we need to cancel sleep */
- if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
- /* use monitoring condition to sleep */
- ret = rte_eth_get_monitor_addr(port_id, qidx,
- &pmc);
- if (ret == 0)
- rte_power_monitor(&pmc, UINT64_MAX);
- }
- q_conf->umwait_in_progress = false;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+ /* use monitoring condition to sleep */
+ ret = rte_eth_get_monitor_addr(port_id, qidx,
+ &pmc);
+ if (ret == 0)
+ rte_power_monitor(&pmc, UINT64_MAX);
}
} else
q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
return nb_rx;
}
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+ struct rte_eth_rxq_info qinfo;
+
+ if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+ return -1;
+
+ return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
struct pmd_queue_cfg *queue_cfg;
struct rte_eth_dev_info info;
+ rte_rx_callback_fn clb;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
queue_cfg = &port_cfg[port_id][queue_id];
if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->umwait_in_progress = false;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* ensure we update our state before callback starts */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_umwait, NULL);
+ clb = clb_umwait;
break;
}
case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
- queue_id, clb_scale_freq, NULL);
+ clb = clb_scale_freq;
break;
}
case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (global_data.tsc_per_us == 0)
calc_tsc();
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_pause, NULL);
+ clb = clb_pause;
break;
+ default:
+ RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+ ret = -EINVAL;
+ goto end;
}
+
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, NULL);
+
ret = 0;
end:
return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
struct pmd_queue_cfg *queue_cfg;
+ int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
return -EINVAL;
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
/* no need to check queue id as wrong queue id would not be enabled */
queue_cfg = &port_cfg[port_id][queue_id];
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
/* stop any callbacks from progressing */
queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
- /* ensure we update our state before continuing */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
switch (queue_cfg->cb_mode) {
- case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- bool exit = false;
- do {
- /*
- * we may request cancellation while the other thread
- * has just entered the callback but hasn't started
- * sleeping yet, so keep waking it up until we know it's
- * done sleeping.
- */
- if (queue_cfg->umwait_in_progress)
- rte_power_monitor_wakeup(lcore_id);
- else
- exit = true;
- } while (!exit);
- }
- /* fall-through */
+ case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
rte_eth_remove_rx_callback(port_id, queue_id,
queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
break;
}
/*
- * we don't free the RX callback here because it is unsafe to do so
- * unless we know for a fact that all data plane threads have stopped.
+ * the API doc mandates that the user stops all processing on affected
+ * ports before calling any of these API's, so we can assume that the
+ * callbacks can be freed. we're intentionally casting away const-ness.
*/
- queue_cfg->cur_cb = NULL;
+ rte_free((void *)queue_cfg->cur_cb);
return 0;
}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue will be polled from.
* @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue is polled from.
* @param port_id
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v4 5/7] power: support callbacks for multiple Rx queues
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
` (3 preceding siblings ...)
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-06-28 15:54 ` Anatoly Burakov
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 6/7] power: support monitoring " Anatoly Burakov
` (2 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
To: dev, David Hunt, Ray Kinsella, Neil Horman
Cc: konstantin.ananyev, ciara.loftus
Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.
The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:
- Replace per-queue structures with per-lcore ones, so that any device
polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
added to the list of cores to poll, so that the callback is aware of
other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
shared between all queues polled on a particular lcore, and is only
activated when a special designated "power saving" queue is polled. To
put it another way, we have no idea which queue the user will poll in
what order, so we rely on them telling us that queue X is the last one
in the polling loop, so any power management should happen there.
- A new API is added to mark a specific Rx queue as "power saving".
Failing to call this API will result in no power management, however
when having only one queue per core it is obvious which queue is the
"power saving" one, so things will still work without this new API for
use cases that were previously working without it.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
is incapable of monitoring more than one address.
Also, while we're at it, update and improve the docs.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v3:
- Move the list of supported NICs to NIC feature table
v2:
- Use a TAILQ for queues instead of a static array
- Address feedback from Konstantin
- Add additional checks for stopped queues
doc/guides/nics/features.rst | 10 +
doc/guides/prog_guide/power_man.rst | 75 +++--
doc/guides/rel_notes/release_21_08.rst | 3 +
lib/power/rte_power_pmd_mgmt.c | 381 ++++++++++++++++++++-----
lib/power/rte_power_pmd_mgmt.h | 34 +++
lib/power/version.map | 3 +
6 files changed, 412 insertions(+), 94 deletions(-)
diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
* **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
* **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
.. _nic_features_other:
Other dev ops not represented by a Feature
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..fac2c19516 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,41 @@ Ethernet PMD Power Management API
Abstract
~~~~~~~~
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
- This power saving scheme will put the CPU into optimized power state
- and use the ``rte_power_monitor()`` function
- to monitor the Ethernet PMD RX descriptor address,
- and wake the CPU up whenever there's new traffic.
-
-Pause
- This power saving scheme will avoid busy polling
- by either entering power-optimized sleep state
- with ``rte_power_pause()`` function,
- or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
- This power saving scheme will use ``librte_power`` library
- functionality to scale the core frequency up/down
- depending on traffic volume.
-
-.. note::
-
- Currently, this power management API is limited to mandatory mapping
- of 1 queue to 1 core (multiple queues are supported,
- but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+ This power saving scheme will put the CPU into optimized power state and
+ monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+ there's new traffic. Support for this scheme may not be available on all
+ platforms, and further limitations may apply (see below).
+
+* Pause
+ This power saving scheme will avoid busy polling by either entering
+ power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+ not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+ This power saving scheme will use ``librte_power`` library functionality to
+ scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+ limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+ monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+ ``rte_power_monitor()`` function is not supported, then monitor mode will not
+ be supported.
+
+* Not all Ethernet devices support monitoring, even if the underlying
+ platform may support the necessary CPU instructions. Please refer to
+ :doc:`../nics/overview` for more information.
+
API Overview for Ethernet PMD Power Management
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -234,6 +241,16 @@ API Overview for Ethernet PMD Power Management
* **Queue Disable**: Disable power scheme for certain queue/port/core.
+* **Set Power Save Queue**: In case of polling multiple queues from one lcore,
+ designate a specific queue to be the one that triggers power management routines.
+
+.. note::
+
+ When using PMD power management with multiple Ethernet Rx queues on one lcore,
+ it is required to designate one of the configured Rx queues as a "power save"
+ queue by calling the appropriate API. Failing to do so will result in no
+ power saving ever taking effect.
+
References
----------
@@ -242,3 +259,5 @@ References
* The :doc:`../sample_app_ug/vm_power_management`
chapter in the :doc:`../sample_app_ug/index` section.
+
+* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index f015c509fc..3926d45ef8 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -57,6 +57,9 @@ New Features
* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+* rte_power: The experimental PMD power management API now supports managing
+ multiple Ethernet Rx queues per lcore.
+
Removed Items
-------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..7762cd39b8 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,7 +33,28 @@ enum pmd_mgmt_state {
PMD_MGMT_ENABLED
};
-struct pmd_queue_cfg {
+union queue {
+ uint32_t val;
+ struct {
+ uint16_t portid;
+ uint16_t qid;
+ };
+};
+
+struct queue_list_entry {
+ TAILQ_ENTRY(queue_list_entry) next;
+ union queue queue;
+};
+
+struct pmd_core_cfg {
+ TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+ /**< Which port-queue pairs are associated with this lcore? */
+ union queue power_save_queue;
+ /**< When polling multiple queues, all but this one will be ignored */
+ bool power_save_queue_set;
+ /**< When polling multiple queues, power save queue must be set */
+ size_t n_queues;
+ /**< How many queues are in the list? */
volatile enum pmd_mgmt_state pwr_mgmt_state;
/**< State of power management for this queue */
enum rte_power_pmd_mgmt_type cb_mode;
@@ -43,8 +64,96 @@ struct pmd_queue_cfg {
uint64_t empty_poll_stats;
/**< Number of empty polls */
} __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE];
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+ return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+ dst->val = src->val;
+}
+
+static inline bool
+queue_is_power_save(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+ const union queue *pwrsave = &cfg->power_save_queue;
+
+ /* if there's only single queue, no need to check anything */
+ if (cfg->n_queues == 1)
+ return true;
+ return cfg->power_save_queue_set && queue_equal(q, pwrsave);
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *cur;
+
+ TAILQ_FOREACH(cur, &cfg->head, next) {
+ if (queue_equal(&cur->queue, q))
+ return cur;
+ }
+ return NULL;
+}
+
+static int
+queue_set_power_save(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ const struct queue_list_entry *found = queue_list_find(cfg, q);
+ if (found == NULL)
+ return -ENOENT;
+ queue_copy(&cfg->power_save_queue, q);
+ cfg->power_save_queue_set = true;
+ return 0;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *qle;
+
+ /* is it already in the list? */
+ if (queue_list_find(cfg, q) != NULL)
+ return -EEXIST;
+
+ qle = malloc(sizeof(*qle));
+ if (qle == NULL)
+ return -ENOMEM;
+
+ queue_copy(&qle->queue, q);
+ TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+ cfg->n_queues++;
+
+ return 0;
+}
+
+static int
+queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *found;
+
+ found = queue_list_find(cfg, q);
+ if (found == NULL)
+ return -ENOENT;
+
+ TAILQ_REMOVE(&cfg->head, found, next);
+ cfg->n_queues--;
+ free(found);
+
+ /* if this was a power save queue, unset it */
+ if (cfg->power_save_queue_set && queue_is_power_save(cfg, q)) {
+ union queue *pwrsave = &cfg->power_save_queue;
+ cfg->power_save_queue_set = false;
+ pwrsave->val = 0;
+ }
+
+ return 0;
+}
static void
calc_tsc(void)
@@ -79,10 +188,10 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
uint16_t nb_rx, uint16_t max_pkts __rte_unused,
void *addr __rte_unused)
{
+ const unsigned int lcore = rte_lcore_id();
+ struct pmd_core_cfg *q_conf;
- struct pmd_queue_cfg *q_conf;
-
- q_conf = &port_cfg[port_id][qidx];
+ q_conf = &lcore_cfg[lcore];
if (unlikely(nb_rx == 0)) {
q_conf->empty_poll_stats++;
@@ -107,11 +216,26 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
uint16_t nb_rx, uint16_t max_pkts __rte_unused,
void *addr __rte_unused)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ const union queue q = {.portid = port_id, .qid = qidx};
+ const bool empty = nb_rx == 0;
+ struct pmd_core_cfg *q_conf;
- q_conf = &port_cfg[port_id][qidx];
+ q_conf = &lcore_cfg[lcore];
- if (unlikely(nb_rx == 0)) {
+ /* early exit */
+ if (likely(!empty)) {
+ q_conf->empty_poll_stats = 0;
+ } else {
+ /* do we care about this particular queue? */
+ if (!queue_is_power_save(q_conf, &q))
+ return nb_rx;
+
+ /*
+ * we can increment unconditionally here because if there were
+ * non-empty polls in other queues assigned to this core, we
+ * dropped the counter to zero anyway.
+ */
q_conf->empty_poll_stats++;
/* sleep for 1 microsecond */
if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
@@ -127,8 +251,7 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
rte_pause();
}
}
- } else
- q_conf->empty_poll_stats = 0;
+ }
return nb_rx;
}
@@ -138,19 +261,33 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
uint16_t max_pkts __rte_unused, void *_ __rte_unused)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ const union queue q = {.portid = port_id, .qid = qidx};
+ const bool empty = nb_rx == 0;
+ struct pmd_core_cfg *q_conf;
- q_conf = &port_cfg[port_id][qidx];
+ q_conf = &lcore_cfg[lcore];
- if (unlikely(nb_rx == 0)) {
+ /* early exit */
+ if (likely(!empty)) {
+ q_conf->empty_poll_stats = 0;
+
+ /* scale up freq immediately */
+ rte_power_freq_max(rte_lcore_id());
+ } else {
+ /* do we care about this particular queue? */
+ if (!queue_is_power_save(q_conf, &q))
+ return nb_rx;
+
+ /*
+ * we can increment unconditionally here because if there were
+ * non-empty polls in other queues assigned to this core, we
+ * dropped the counter to zero anyway.
+ */
q_conf->empty_poll_stats++;
if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
/* scale down freq */
rte_power_freq_min(rte_lcore_id());
- } else {
- q_conf->empty_poll_stats = 0;
- /* scale up freq */
- rte_power_freq_max(rte_lcore_id());
}
return nb_rx;
@@ -167,11 +304,79 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
}
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+ const struct queue_list_entry *entry;
+
+ TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+ const union queue *q = &entry->queue;
+ int ret = queue_stopped(q->portid, q->qid);
+ if (ret != 1)
+ return ret;
+ }
+ return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+ enum power_management_env env;
+
+ /* only PSTATE and ACPI modes are supported */
+ if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+ !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+ return -ENOTSUP;
+ }
+ /* ensure we could initialize the power library */
+ if (rte_power_init(lcore))
+ return -EINVAL;
+
+ /* ensure we initialized the correct env */
+ env = rte_power_get_env();
+ if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+ struct rte_power_monitor_cond dummy;
+
+ /* check if rte_power_monitor is supported */
+ if (!global_data.intrinsics_support.power_monitor) {
+ RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+ return -ENOTSUP;
+ }
+
+ if (cfg->n_queues > 0) {
+ RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+ return -ENOTSUP;
+ }
+
+ /* check if the device supports the necessary PMD API */
+ if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+ &dummy) == -ENOTSUP) {
+ RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *queue_cfg;
struct rte_eth_dev_info info;
rte_rx_callback_fn clb;
int ret;
@@ -202,9 +407,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
- queue_cfg = &port_cfg[port_id][queue_id];
+ queue_cfg = &lcore_cfg[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(queue_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
+ /* if callback was already enabled, check current callback type */
+ if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+ queue_cfg->cb_mode != mode) {
ret = -EINVAL;
goto end;
}
@@ -214,53 +429,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
switch (mode) {
case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- struct rte_power_monitor_cond dummy;
-
- /* check if rte_power_monitor is supported */
- if (!global_data.intrinsics_support.power_monitor) {
- RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_monitor(queue_cfg, &qdata);
+ if (ret < 0)
goto end;
- }
- /* check if the device supports the necessary PMD API */
- if (rte_eth_get_monitor_addr(port_id, queue_id,
- &dummy) == -ENOTSUP) {
- RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_umwait;
break;
- }
case RTE_POWER_MGMT_TYPE_SCALE:
- {
- enum power_management_env env;
- /* only PSTATE and ACPI modes are supported */
- if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
- !rte_power_check_env_supported(
- PM_ENV_PSTATE_CPUFREQ)) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_scale(lcore_id);
+ if (ret < 0)
goto end;
- }
- /* ensure we could initialize the power library */
- if (rte_power_init(lcore_id)) {
- ret = -EINVAL;
- goto end;
- }
- /* ensure we initialized the correct env */
- env = rte_power_get_env();
- if (env != PM_ENV_ACPI_CPUFREQ &&
- env != PM_ENV_PSTATE_CPUFREQ) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_scale_freq;
break;
- }
case RTE_POWER_MGMT_TYPE_PAUSE:
/* figure out various time-to-tsc conversions */
if (global_data.tsc_per_us == 0)
@@ -273,11 +455,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -EINVAL;
goto end;
}
+ /* add this queue to the list */
+ ret = queue_list_add(queue_cfg, &qdata);
+ if (ret < 0) {
+ RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+ strerror(-ret));
+ goto end;
+ }
/* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ if (queue_cfg->n_queues == 1) {
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ }
queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
clb, NULL);
@@ -290,7 +481,8 @@ int
rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *queue_cfg;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,13 +498,31 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
}
/* no need to check queue id as wrong queue id would not be enabled */
- queue_cfg = &port_cfg[port_id][queue_id];
+ queue_cfg = &lcore_cfg[lcore_id];
+
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(queue_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
return -EINVAL;
- /* stop any callbacks from progressing */
- queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+ /*
+ * There is no good/easy way to do this without race conditions, so we
+ * are just going to throw our hands in the air and hope that the user
+ * has read the documentation and has ensured that ports are stopped at
+ * the time we enter the API functions.
+ */
+ ret = queue_list_remove(queue_cfg, &qdata);
+ if (ret < 0)
+ return -ret;
+
+ /* if we've removed all queues from the lists, set state to disabled */
+ if (queue_cfg->n_queues == 0)
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
switch (queue_cfg->cb_mode) {
case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
@@ -336,3 +546,42 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
return 0;
}
+
+int
+rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id,
+ uint16_t port_id, uint16_t queue_id)
+{
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *queue_cfg;
+ int ret;
+
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+ if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+ return -EINVAL;
+
+ /* no need to check queue id as wrong queue id would not be enabled */
+ queue_cfg = &lcore_cfg[lcore_id];
+
+ if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+ return -EINVAL;
+
+ ret = queue_set_power_save(queue_cfg, &qdata);
+ if (ret < 0) {
+ RTE_LOG(DEBUG, POWER, "Failed to set power save queue: %s\n",
+ strerror(-ret));
+ return -ret;
+ }
+
+ return 0;
+}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+ size_t i;
+
+ /* initialize all tailqs */
+ for (i = 0; i < RTE_DIM(lcore_cfg); i++) {
+ struct pmd_core_cfg *cfg = &lcore_cfg[i];
+ TAILQ_INIT(&cfg->head);
+ }
+}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 444e7b8a66..d6ef8f778a 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -90,6 +90,40 @@ int
rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Set a specific Ethernet device Rx queue to be the "power save" queue for a
+ * particular lcore. When multiple queues are assigned to a single lcore using
+ * the `rte_power_ethdev_pmgmt_queue_enable` API, only one of them will trigger
+ * the power management. In a typical scenario, the last queue to be polled on
+ * a particular lcore should be designated as power save queue.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @note When using multiple queues per lcore, calling this function is
+ * mandatory. If not called, no power management routines would be triggered
+ * when the traffic starts.
+ *
+ * @warning This function must be called when all affected Ethernet ports are
+ * stopped and no Rx/Tx is in progress!
+ *
+ * @param lcore_id
+ * The lcore the Rx queue is polled from.
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The queue identifier of the Ethernet device.
+ * @return
+ * 0 on success
+ * <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id,
+ uint16_t port_id, uint16_t queue_id);
+
#ifdef __cplusplus
}
#endif
diff --git a/lib/power/version.map b/lib/power/version.map
index b004e3e4a9..105d1d94c2 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -38,4 +38,7 @@ EXPERIMENTAL {
# added in 21.02
rte_power_ethdev_pmgmt_queue_disable;
rte_power_ethdev_pmgmt_queue_enable;
+
+ # added in 21.08
+ rte_power_ethdev_pmgmt_queue_set_power_save;
};
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v4 6/7] power: support monitoring multiple Rx queues
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
` (4 preceding siblings ...)
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-28 15:54 ` Anatoly Burakov
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v4:
- Fix possible out of bounds access
- Added missing index increment
doc/guides/prog_guide/power_man.rst | 9 ++--
lib/power/rte_power_pmd_mgmt.c | 84 ++++++++++++++++++++++++++++-
2 files changed, 88 insertions(+), 5 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index fac2c19516..3245a5ebed 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
The "monitor" mode is only supported in the following configurations and scenarios:
* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor_multi()`` function is supported by the platform, then
+ monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
``rte_power_monitor()`` is supported by the platform, then monitoring will be
limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
monitored from a different lcore).
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
- ``rte_power_monitor()`` function is not supported, then monitor mode will not
- be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+ two monitoring functions are supported, then monitor mode will not be supported.
* Not all Ethernet devices support monitoring, even if the underlying
platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 7762cd39b8..97c9f1ea36 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -155,6 +155,32 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
return 0;
}
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+ struct rte_power_monitor_cond *pmc, size_t len)
+{
+ const struct queue_list_entry *qle;
+ size_t i = 0;
+ int ret;
+
+ TAILQ_FOREACH(qle, &cfg->head, next) {
+ const union queue *q = &qle->queue;
+ struct rte_power_monitor_cond *cur;
+
+ /* attempted out of bounds access */
+ if (i >= len) {
+ RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
+ return -1;
+ }
+
+ cur = &pmc[i++];
+ ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
static void
calc_tsc(void)
{
@@ -183,6 +209,48 @@ calc_tsc(void)
}
}
+static uint16_t
+clb_multiwait(uint16_t port_id, uint16_t qidx,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+ const unsigned int lcore = rte_lcore_id();
+ const union queue q = {.portid = port_id, .qid = qidx};
+ const bool empty = nb_rx == 0;
+ struct pmd_core_cfg *q_conf;
+
+ q_conf = &lcore_cfg[lcore];
+
+ /* early exit */
+ if (likely(!empty)) {
+ q_conf->empty_poll_stats = 0;
+ } else {
+ /* do we care about this particular queue? */
+ if (!queue_is_power_save(q_conf, &q))
+ return nb_rx;
+
+ /*
+ * we can increment unconditionally here because if there were
+ * non-empty polls in other queues assigned to this core, we
+ * dropped the counter to zero anyway.
+ */
+ q_conf->empty_poll_stats++;
+ if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+ struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
+ uint16_t ret;
+
+ /* gather all monitoring conditions */
+ ret = get_monitor_addresses(q_conf, pmc, RTE_DIM(pmc));
+
+ if (ret == 0)
+ rte_power_monitor_multi(pmc,
+ q_conf->n_queues, UINT64_MAX);
+ }
+ }
+
+ return nb_rx;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
uint16_t nb_rx, uint16_t max_pkts __rte_unused,
@@ -348,14 +416,19 @@ static int
check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
{
struct rte_power_monitor_cond dummy;
+ bool multimonitor_supported;
/* check if rte_power_monitor is supported */
if (!global_data.intrinsics_support.power_monitor) {
RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
return -ENOTSUP;
}
+ /* check if multi-monitor is supported */
+ multimonitor_supported =
+ global_data.intrinsics_support.power_monitor_multi;
- if (cfg->n_queues > 0) {
+ /* if we're adding a new queue, do we support multiple queues? */
+ if (cfg->n_queues > 0 && !multimonitor_supported) {
RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
return -ENOTSUP;
}
@@ -371,6 +444,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
return 0;
}
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+ return global_data.intrinsics_support.power_monitor_multi ?
+ clb_multiwait : clb_umwait;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -434,7 +514,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (ret < 0)
goto end;
- clb = clb_umwait;
+ clb = get_monitor_callback();
break;
case RTE_POWER_MGMT_TYPE_SCALE:
/* check if we can add a new queue */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v4 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
` (5 preceding siblings ...)
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 6/7] power: support monitoring " Anatoly Burakov
@ 2021-06-28 15:54 ` Anatoly Burakov
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation, and always
mark the last queue in qconf as the power save queue.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
examples/l3fwd-power/main.c | 39 +++++++++++++++++++++++--------------
1 file changed, 24 insertions(+), 15 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..3057c06936 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2498,6 +2498,27 @@ mode_to_str(enum appmode mode)
}
}
+static void
+pmd_pmgmt_set_up(unsigned int lcore, uint16_t portid, uint16_t qid, bool last)
+{
+ int ret;
+
+ ret = rte_power_ethdev_pmgmt_queue_enable(lcore, portid,
+ qid, pmgmt_type);
+ if (ret < 0)
+ rte_exit(EXIT_FAILURE,
+ "rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
+ ret, portid);
+
+ if (!last)
+ return;
+ ret = rte_power_ethdev_pmgmt_queue_set_power_save(lcore, portid, qid);
+ if (ret < 0)
+ rte_exit(EXIT_FAILURE,
+ "rte_power_ethdev_pmgmt_queue_set_power_save: err=%d, port=%d\n",
+ ret, portid);
+}
+
int
main(int argc, char **argv)
{
@@ -2723,12 +2744,6 @@ main(int argc, char **argv)
printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
fflush(stdout);
- /* PMD power management mode can only do 1 queue per core */
- if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
- rte_exit(EXIT_FAILURE,
- "In PMD power management mode, only one queue per lcore is allowed\n");
- }
-
/* init RX queues */
for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
struct rte_eth_rxconf rxq_conf;
@@ -2767,15 +2782,9 @@ main(int argc, char **argv)
"Fail to add ptype cb\n");
}
- if (app_mode == APP_MODE_PMD_MGMT) {
- ret = rte_power_ethdev_pmgmt_queue_enable(
- lcore_id, portid, queueid,
- pmgmt_type);
- if (ret < 0)
- rte_exit(EXIT_FAILURE,
- "rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
- ret, portid);
- }
+ if (app_mode == APP_MODE_PMD_MGMT)
+ pmd_pmgmt_set_up(lcore_id, portid, queueid,
+ queue == (qconf->n_rx_queue - 1));
}
}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
` (6 preceding siblings ...)
2021-06-28 15:54 ` [dpdk-dev] [PATCH v4 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-06-29 15:48 ` Anatoly Burakov
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
` (7 more replies)
7 siblings, 8 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
To: dev; +Cc: david.hunt, konstantin.ananyev, ciara.loftus
This patchset introduces several changes related to PMD power management:
- Changed monitoring intrinsics to use callbacks as a comparison function, based
on previous patchset [1] but incorporating feedback [2] - this hopefully will
make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
accompanying infrastructure and example apps changes
v5:
- Removed "power save queue" API and replaced with mechanism suggested by
Konstantin
- Addressed other feedback
v4:
- Replaced raw number with a macro
- Fixed all the bugs found by Konstantin
- Some other minor corrections
v3:
- Moved some doc updates to NIC features list
v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary
[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274
Anatoly Burakov (7):
power_intrinsics: use callbacks for comparison
net/af_xdp: add power monitor support
eal: add power monitor for multiple events
power: remove thread safety from PMD power API's
power: support callbacks for multiple Rx queues
power: support monitoring multiple Rx queues
l3fwd-power: support multiqueue in PMD pmgmt modes
doc/guides/nics/features.rst | 10 +
doc/guides/prog_guide/power_man.rst | 68 +-
doc/guides/rel_notes/release_21_08.rst | 11 +
drivers/event/dlb2/dlb2.c | 17 +-
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +
drivers/net/i40e/i40e_rxtx.c | 20 +-
drivers/net/iavf/iavf_rxtx.c | 20 +-
drivers/net/ice/ice_rxtx.c | 20 +-
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +-
drivers/net/mlx5/mlx5_rx.c | 17 +-
examples/l3fwd-power/main.c | 6 -
lib/eal/arm/rte_power_intrinsics.c | 11 +
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 68 +-
lib/eal/ppc/rte_power_intrinsics.c | 11 +
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 90 ++-
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 633 +++++++++++++-----
lib/power/rte_power_pmd_mgmt.h | 6 +
21 files changed, 810 insertions(+), 262 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v5 1/7] power_intrinsics: use callbacks for comparison
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-06-29 15:48 ` Anatoly Burakov
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 2/7] net/af_xdp: add power monitor support Anatoly Burakov
` (6 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.
This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.
Existing implementations are adjusted to follow the new semantics.
Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
Notes:
v4:
- Return error if callback is set to NULL
- Replace raw number with a macro in monitor condition opaque data
v2:
- Use callback mechanism for more flexibility
- Address feedback from Konstantin
doc/guides/rel_notes/release_21_08.rst | 1 +
drivers/event/dlb2/dlb2.c | 17 ++++++++--
drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
drivers/net/ice/ice_rxtx.c | 20 +++++++----
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
.../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
9 files changed, 121 insertions(+), 44 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index a6ecfdf3ce..c84ac280f5 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -84,6 +84,7 @@ API Changes
Also, make sure to start the actual text at the margin.
=======================================================
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
ABI Changes
-----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
}
}
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ /* abort if the value matches */
+ return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
static inline int
dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
expected_value = 0;
pmc.addr = monitor_addr;
- pmc.val = expected_value;
- pmc.mask = qe_mask.raw_qe[1];
+ /* store expected value and comparison mask in opaque data */
+ pmc.opaque[CLB_VAL_IDX] = expected_value;
+ pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+ /* set up callback */
+ pmc.fn = dlb2_monitor_callback;
pmc.size = sizeof(uint64_t);
rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decece..081682f88b 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
#define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
+static int
+i40e_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = i40e_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index 0361af0d85..7ed196ec22 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
}
+static int
+iavf_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = iavf_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index fc9bb5a3e7..d12437d19d 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
+static int
+ice_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.status_error0;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
- pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /* comparison callback */
+ pmc->fn = ice_monitor_callback;
/* register is 16-bit */
pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
};
+static int
+ixgbe_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.upper.status_error;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
- pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /* comparison callback */
+ pmc->fn = ixgbe_monitor_callback;
/* the registers are 32-bit */
pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..17370b77dc 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
return rx_queue_count(rxq);
}
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t m = opaque[CLB_MSK_IDX];
+ const uint64_t v = opaque[CLB_VAL_IDX];
+
+ return (value & m) == v ? -1 : 0;
+}
+
int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
return -rte_errno;
}
pmc->addr = &cqe->op_own;
- pmc->val = !!idx;
- pmc->mask = MLX5_CQE_OWNER_MASK;
+ pmc->opaque[CLB_VAL_IDX] = !!idx;
+ pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+ pmc->fn = mlx_monitor_callback;
pmc->size = sizeof(uint8_t);
return 0;
}
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
* which are architecture-dependent.
*/
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ * The value read from memory.
+ * @param opaque
+ * Callback-specific data.
+ *
+ * @return
+ * 0 if entering of power optimized state should proceed
+ * -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
struct rte_power_monitor_cond {
volatile void *addr; /**< Address to monitor for changes */
- uint64_t val; /**< If the `mask` is non-zero, location pointed
- * to by `addr` will be read and compared
- * against this value.
- */
- uint64_t mask; /**< 64-bit mask to extract value read from `addr` */
- uint8_t size; /**< Data size (in bytes) that will be used to compare
- * expected value (`val`) with data read from the
+ uint8_t size; /**< Data size (in bytes) that will be read from the
* monitored memory location (`addr`). Can be 1, 2,
* 4, or 8. Supplying any other value will result in
* an error.
*/
+ rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+ * entering power optimized state should
+ * be aborted.
+ */
+ uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+ /**< Callback-specific data */
};
/**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
const unsigned int lcore_id = rte_lcore_id();
struct power_wait_status *s;
+ uint64_t cur_value;
/* prevent user from running this instruction if it's not supported */
if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
if (__check_val_size(pmc->size) < 0)
return -EINVAL;
+ if (pmc->fn == NULL)
+ return -EINVAL;
+
s = &wait_status[lcore_id];
/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
/* now that we've put this address into monitor, we can unlock */
rte_spinlock_unlock(&s->lock);
- /* if we have a comparison mask, we might not need to sleep at all */
- if (pmc->mask) {
- const uint64_t cur_value = __get_umwait_val(
- pmc->addr, pmc->size);
- const uint64_t masked = cur_value & pmc->mask;
+ cur_value = __get_umwait_val(pmc->addr, pmc->size);
- /* if the masked value is already matching, abort */
- if (masked == pmc->val)
- goto end;
- }
+ /* check if callback indicates we should abort */
+ if (pmc->fn(cur_value, pmc->opaque) != 0)
+ goto end;
/* execute UMWAIT */
asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v5 2/7] net/af_xdp: add power monitor support
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-06-29 15:48 ` Anatoly Burakov
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 3/7] eal: add power monitor for multiple events Anatoly Burakov
` (5 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt, konstantin.ananyev
Implement support for .get_monitor_addr in AF_XDP driver.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v2:
- Rewrite using the callback mechanism
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..7830d0c23a 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
#include <rte_malloc.h>
#include <rte_ring.h>
#include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
#include "compat.h"
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
return 0;
}
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t v = opaque[CLB_VAL_IDX];
+ const uint64_t m = (uint32_t)~0;
+
+ /* if the value has changed, abort entering power optimized state */
+ return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+ struct pkt_rx_queue *rxq = rx_queue;
+ unsigned int *prod = rxq->rx.producer;
+ const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+ /* watch for changes in producer ring */
+ pmc->addr = (void*)prod;
+
+ /* store current value */
+ pmc->opaque[CLB_VAL_IDX] = cur_val;
+ pmc->fn = eth_monitor_callback;
+
+ /* AF_XDP producer ring index is 32-bit */
+ pmc->size = sizeof(uint32_t);
+
+ return 0;
+}
+
static int
eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
{
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
.link_update = eth_link_update,
.stats_get = eth_stats_get,
.stats_reset = eth_stats_reset,
+ .get_monitor_addr = eth_get_monitor_addr
};
/** parse busy_budget argument */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v5 3/7] eal: add power monitor for multiple events
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-06-29 15:48 ` Anatoly Burakov
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
` (4 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
To: dev, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v4:
- Fixed bugs in accessing the monitor condition
- Abort on any monitor condition not having a defined callback
v2:
- Adapt to callback mechanism
doc/guides/rel_notes/release_21_08.rst | 2 +
lib/eal/arm/rte_power_intrinsics.c | 11 +++
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 35 +++++++++
lib/eal/ppc/rte_power_intrinsics.c | 11 +++
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++
8 files changed, 139 insertions(+)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index c84ac280f5..9d1cfac395 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -55,6 +55,8 @@ New Features
Also, make sure to start the actual text at the margin.
=======================================================
+* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+
Removed Items
-------------
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
/**< indicates support for rte_power_monitor function */
uint32_t power_pause : 1;
/**< indicates support for rte_power_pause function */
+ uint32_t power_monitor_multi : 1;
+ /**< indicates support for rte_power_monitor_multi function */
};
/**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
__rte_experimental
int rte_power_pause(const uint64_t tsc_timestamp);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ * supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ * Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ * An array of monitoring condition structures.
+ * @param num
+ * Length of the `pmc` array.
+ * @param tsc_timestamp
+ * Maximum TSC timestamp to wait for. Note that the wait behavior is
+ * architecture-dependent.
+ *
+ * @return
+ * 0 on success
+ * -EINVAL on invalid parameters
+ * -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp);
+
#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..4ccd5475d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
rte_version_release; # WINDOWS_NO_EXPORT
rte_version_suffix; # WINDOWS_NO_EXPORT
rte_version_year; # WINDOWS_NO_EXPORT
+
+ # added in 21.08
+ rte_power_monitor_multi; # WINDOWS_NO_EXPORT
};
INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
intrinsics->power_monitor = 1;
intrinsics->power_pause = 1;
+ if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+ intrinsics->power_monitor_multi = 1;
}
}
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
#include <rte_common.h>
#include <rte_lcore.h>
+#include <rte_rtm.h>
#include <rte_spinlock.h>
#include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
}
static bool wait_supported;
+static bool wait_multi_supported;
static inline uint64_t
__get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
if (i.power_monitor && i.power_pause)
wait_supported = 1;
+ if (i.power_monitor_multi)
+ wait_multi_supported = 1;
}
int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
* In this case, since we've already woken up, the "wakeup" was
* unneeded, and since T1 is still waiting on T2 releasing the lock, the
* wakeup address is still valid so it's perfectly safe to write it.
+ *
+ * For multi-monitor case, the act of locking will in itself trigger the
+ * wakeup, so no additional writes necessary.
*/
rte_spinlock_lock(&s->lock);
if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return 0;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ const unsigned int lcore_id = rte_lcore_id();
+ struct power_wait_status *s = &wait_status[lcore_id];
+ uint32_t i, rc;
+
+ /* check if supported */
+ if (!wait_multi_supported)
+ return -ENOTSUP;
+
+ if (pmc == NULL || num == 0)
+ return -EINVAL;
+
+ /* we are already inside transaction region, return */
+ if (rte_xtest() != 0)
+ return 0;
+
+ /* start new transaction region */
+ rc = rte_xbegin();
+
+ /* transaction abort, possible write to one of wait addresses */
+ if (rc != RTE_XBEGIN_STARTED)
+ return 0;
+
+ /*
+ * the mere act of reading the lock status here adds the lock to
+ * the read set. This means that when we trigger a wakeup from another
+ * thread, even if we don't have a defined wakeup address and thus don't
+ * actually cause any writes, the act of locking our lock will itself
+ * trigger the wakeup and abort the transaction.
+ */
+ rte_spinlock_is_locked(&s->lock);
+
+ /*
+ * add all addresses to wait on into transaction read-set and check if
+ * any of wakeup conditions are already met.
+ */
+ rc = 0;
+ for (i = 0; i < num; i++) {
+ const struct rte_power_monitor_cond *c = &pmc[i];
+
+ /* cannot be NULL */
+ if (c->fn == NULL) {
+ rc = -EINVAL;
+ break;
+ }
+
+ const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+ /* abort if callback indicates that we need to stop */
+ if (c->fn(val, c->opaque) != 0)
+ break;
+ }
+
+ /* none of the conditions were met, sleep until timeout */
+ if (i == num)
+ rte_power_pause(tsc_timestamp);
+
+ /* end transaction region */
+ rte_xend();
+
+ return rc;
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v5 4/7] power: remove thread safety from PMD power API's
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
` (2 preceding siblings ...)
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-06-29 15:48 ` Anatoly Burakov
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
` (3 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.
We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v2:
- Add check for stopped queue
- Clarified doc message
- Added release notes
doc/guides/rel_notes/release_21_08.rst | 5 +
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 133 ++++++++++---------------
lib/power/rte_power_pmd_mgmt.h | 6 ++
4 files changed, 67 insertions(+), 80 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 9d1cfac395..f015c509fc 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -88,6 +88,11 @@ API Changes
* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+* rte_power: The experimental PMD power management API is no longer considered
+ to be thread safe; all Rx queues affected by the API will now need to be
+ stopped before making any changes to the power management scheme.
+
+
ABI Changes
-----------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
'rte_power_pmd_mgmt.h',
'rte_power_guest_channel.h',
)
+if cc.has_argument('-Wno-cast-qual')
+ cflags += '-Wno-cast-qual'
+endif
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
/**< Callback mode for this queue */
const struct rte_eth_rxtx_callback *cur_cb;
/**< Callback instance */
- volatile bool umwait_in_progress;
- /**< are we currently sleeping? */
uint64_t empty_poll_stats;
/**< Number of empty polls */
} __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
struct rte_power_monitor_cond pmc;
uint16_t ret;
- /*
- * we might get a cancellation request while being
- * inside the callback, in which case the wakeup
- * wouldn't work because it would've arrived too early.
- *
- * to get around this, we notify the other thread that
- * we're sleeping, so that it can spin until we're done.
- * unsolicited wakeups are perfectly safe.
- */
- q_conf->umwait_in_progress = true;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- /* check if we need to cancel sleep */
- if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
- /* use monitoring condition to sleep */
- ret = rte_eth_get_monitor_addr(port_id, qidx,
- &pmc);
- if (ret == 0)
- rte_power_monitor(&pmc, UINT64_MAX);
- }
- q_conf->umwait_in_progress = false;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+ /* use monitoring condition to sleep */
+ ret = rte_eth_get_monitor_addr(port_id, qidx,
+ &pmc);
+ if (ret == 0)
+ rte_power_monitor(&pmc, UINT64_MAX);
}
} else
q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
return nb_rx;
}
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+ struct rte_eth_rxq_info qinfo;
+
+ if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+ return -1;
+
+ return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
struct pmd_queue_cfg *queue_cfg;
struct rte_eth_dev_info info;
+ rte_rx_callback_fn clb;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
queue_cfg = &port_cfg[port_id][queue_id];
if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->umwait_in_progress = false;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* ensure we update our state before callback starts */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_umwait, NULL);
+ clb = clb_umwait;
break;
}
case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
- queue_id, clb_scale_freq, NULL);
+ clb = clb_scale_freq;
break;
}
case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (global_data.tsc_per_us == 0)
calc_tsc();
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_pause, NULL);
+ clb = clb_pause;
break;
+ default:
+ RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+ ret = -EINVAL;
+ goto end;
}
+
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, NULL);
+
ret = 0;
end:
return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
struct pmd_queue_cfg *queue_cfg;
+ int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
return -EINVAL;
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
/* no need to check queue id as wrong queue id would not be enabled */
queue_cfg = &port_cfg[port_id][queue_id];
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
/* stop any callbacks from progressing */
queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
- /* ensure we update our state before continuing */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
switch (queue_cfg->cb_mode) {
- case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- bool exit = false;
- do {
- /*
- * we may request cancellation while the other thread
- * has just entered the callback but hasn't started
- * sleeping yet, so keep waking it up until we know it's
- * done sleeping.
- */
- if (queue_cfg->umwait_in_progress)
- rte_power_monitor_wakeup(lcore_id);
- else
- exit = true;
- } while (!exit);
- }
- /* fall-through */
+ case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
rte_eth_remove_rx_callback(port_id, queue_id,
queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
break;
}
/*
- * we don't free the RX callback here because it is unsafe to do so
- * unless we know for a fact that all data plane threads have stopped.
+ * the API doc mandates that the user stops all processing on affected
+ * ports before calling any of these API's, so we can assume that the
+ * callbacks can be freed. we're intentionally casting away const-ness.
*/
- queue_cfg->cur_cb = NULL;
+ rte_free((void *)queue_cfg->cur_cb);
return 0;
}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue will be polled from.
* @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue is polled from.
* @param port_id
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
` (3 preceding siblings ...)
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-06-29 15:48 ` Anatoly Burakov
2021-06-30 9:52 ` David Hunt
2021-06-30 11:04 ` Ananyev, Konstantin
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 6/7] power: support monitoring " Anatoly Burakov
` (2 subsequent siblings)
7 siblings, 2 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.
The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:
- Replace per-queue structures with per-lcore ones, so that any device
polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
added to the list of queues to poll, so that the callback is aware of
other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
shared between all queues polled on a particular lcore, and is only
activated when all queues in the list were polled and were determined
to have no traffic.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
is incapable of monitoring more than one address.
Also, while we're at it, update and improve the docs.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v5:
- Remove the "power save queue" API and replace it with mechanism suggested by
Konstantin
v3:
- Move the list of supported NICs to NIC feature table
v2:
- Use a TAILQ for queues instead of a static array
- Address feedback from Konstantin
- Add additional checks for stopped queues
doc/guides/nics/features.rst | 10 +
doc/guides/prog_guide/power_man.rst | 65 ++--
doc/guides/rel_notes/release_21_08.rst | 3 +
lib/power/rte_power_pmd_mgmt.c | 431 ++++++++++++++++++-------
4 files changed, 373 insertions(+), 136 deletions(-)
diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
* **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
* **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
.. _nic_features_other:
Other dev ops not represented by a Feature
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..ec04a72108 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,41 @@ Ethernet PMD Power Management API
Abstract
~~~~~~~~
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
- This power saving scheme will put the CPU into optimized power state
- and use the ``rte_power_monitor()`` function
- to monitor the Ethernet PMD RX descriptor address,
- and wake the CPU up whenever there's new traffic.
-
-Pause
- This power saving scheme will avoid busy polling
- by either entering power-optimized sleep state
- with ``rte_power_pause()`` function,
- or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
- This power saving scheme will use ``librte_power`` library
- functionality to scale the core frequency up/down
- depending on traffic volume.
-
-.. note::
-
- Currently, this power management API is limited to mandatory mapping
- of 1 queue to 1 core (multiple queues are supported,
- but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+ This power saving scheme will put the CPU into optimized power state and
+ monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+ there's new traffic. Support for this scheme may not be available on all
+ platforms, and further limitations may apply (see below).
+
+* Pause
+ This power saving scheme will avoid busy polling by either entering
+ power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+ not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+ This power saving scheme will use ``librte_power`` library functionality to
+ scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+ limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+ monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+ ``rte_power_monitor()`` function is not supported, then monitor mode will not
+ be supported.
+
+* Not all Ethernet drivers support monitoring, even if the underlying
+ platform may support the necessary CPU instructions. Please refer to
+ :doc:`../nics/overview` for more information.
+
API Overview for Ethernet PMD Power Management
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -242,3 +249,5 @@ References
* The :doc:`../sample_app_ug/vm_power_management`
chapter in the :doc:`../sample_app_ug/index` section.
+
+* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index f015c509fc..3926d45ef8 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -57,6 +57,9 @@ New Features
* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+* rte_power: The experimental PMD power management API now supports managing
+ multiple Ethernet Rx queues per lcore.
+
Removed Items
-------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..fccfd236c2 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,18 +33,96 @@ enum pmd_mgmt_state {
PMD_MGMT_ENABLED
};
-struct pmd_queue_cfg {
+union queue {
+ uint32_t val;
+ struct {
+ uint16_t portid;
+ uint16_t qid;
+ };
+};
+
+struct queue_list_entry {
+ TAILQ_ENTRY(queue_list_entry) next;
+ union queue queue;
+ uint64_t n_empty_polls;
+ const struct rte_eth_rxtx_callback *cb;
+};
+
+struct pmd_core_cfg {
+ TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+ /**< List of queues associated with this lcore */
+ size_t n_queues;
+ /**< How many queues are in the list? */
volatile enum pmd_mgmt_state pwr_mgmt_state;
/**< State of power management for this queue */
enum rte_power_pmd_mgmt_type cb_mode;
/**< Callback mode for this queue */
- const struct rte_eth_rxtx_callback *cur_cb;
- /**< Callback instance */
- uint64_t empty_poll_stats;
- /**< Number of empty polls */
+ uint64_t n_queues_ready_to_sleep;
+ /**< Number of queues ready to enter power optimized state */
} __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+ return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+ dst->val = src->val;
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *cur;
+
+ TAILQ_FOREACH(cur, &cfg->head, next) {
+ if (queue_equal(&cur->queue, q))
+ return cur;
+ }
+ return NULL;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *qle;
+
+ /* is it already in the list? */
+ if (queue_list_find(cfg, q) != NULL)
+ return -EEXIST;
+
+ qle = malloc(sizeof(*qle));
+ if (qle == NULL)
+ return -ENOMEM;
+ memset(qle, 0, sizeof(*qle));
+
+ queue_copy(&qle->queue, q);
+ TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+ cfg->n_queues++;
+ qle->n_empty_polls = 0;
+
+ return 0;
+}
+
+static struct queue_list_entry *
+queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *found;
+
+ found = queue_list_find(cfg, q);
+ if (found == NULL)
+ return NULL;
+
+ TAILQ_REMOVE(&cfg->head, found, next);
+ cfg->n_queues--;
+
+ /* freeing is responsibility of the caller */
+ return found;
+}
static void
calc_tsc(void)
@@ -74,21 +152,56 @@ calc_tsc(void)
}
}
+static inline void
+queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ /* reset empty poll counter for this queue */
+ qcfg->n_empty_polls = 0;
+ /* reset the sleep counter too */
+ cfg->n_queues_ready_to_sleep = 0;
+}
+
+static inline bool
+queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ /* this function is called - that means we have an empty poll */
+ qcfg->n_empty_polls++;
+
+ /* if we haven't reached threshold for empty polls, we can't sleep */
+ if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
+ return false;
+
+ /* we're ready to sleep */
+ cfg->n_queues_ready_to_sleep++;
+
+ return true;
+}
+
+static inline bool
+lcore_can_sleep(struct pmd_core_cfg *cfg)
+{
+ /* are all queues ready to sleep? */
+ if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
+ return false;
+
+ /* we've reached an iteration where we can sleep, reset sleep counter */
+ cfg->n_queues_ready_to_sleep = 0;
+
+ return true;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+ uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
{
+ struct queue_list_entry *queue_conf = arg;
- struct pmd_queue_cfg *q_conf;
-
- q_conf = &port_cfg[port_id][qidx];
-
+ /* this callback can't do more than one queue, omit multiqueue logic */
if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+ queue_conf->n_empty_polls++;
+ if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
struct rte_power_monitor_cond pmc;
- uint16_t ret;
+ int ret;
/* use monitoring condition to sleep */
ret = rte_eth_get_monitor_addr(port_id, qidx,
@@ -97,60 +210,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
rte_power_monitor(&pmc, UINT64_MAX);
}
} else
- q_conf->empty_poll_stats = 0;
+ queue_conf->n_empty_polls = 0;
return nb_rx;
}
static uint16_t
-clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
- q_conf = &port_cfg[port_id][qidx];
+ lcore_conf = &lcore_cfgs[lcore];
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- /* sleep for 1 microsecond */
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
- /* use tpause if we have it */
- if (global_data.intrinsics_support.power_pause) {
- const uint64_t cur = rte_rdtsc();
- const uint64_t wait_tsc =
- cur + global_data.tsc_per_us;
- rte_power_pause(wait_tsc);
- } else {
- uint64_t i;
- for (i = 0; i < global_data.pause_per_us; i++)
- rte_pause();
- }
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* sleep for 1 microsecond, use tpause if we have it */
+ if (global_data.intrinsics_support.power_pause) {
+ const uint64_t cur = rte_rdtsc();
+ const uint64_t wait_tsc =
+ cur + global_data.tsc_per_us;
+ rte_power_pause(wait_tsc);
+ } else {
+ uint64_t i;
+ for (i = 0; i < global_data.pause_per_us; i++)
+ rte_pause();
}
- } else
- q_conf->empty_poll_stats = 0;
+ }
return nb_rx;
}
static uint16_t
-clb_scale_freq(uint16_t port_id, uint16_t qidx,
+clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
- uint16_t max_pkts __rte_unused, void *_ __rte_unused)
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ const bool empty = nb_rx == 0;
+ struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+ struct queue_list_entry *queue_conf = arg;
- q_conf = &port_cfg[port_id][qidx];
+ if (likely(!empty)) {
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
- /* scale down freq */
- rte_power_freq_min(rte_lcore_id());
- } else {
- q_conf->empty_poll_stats = 0;
- /* scale up freq */
+ /* scale up freq immediately */
rte_power_freq_max(rte_lcore_id());
+ } else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ rte_power_freq_min(rte_lcore_id());
}
return nb_rx;
@@ -167,11 +297,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
}
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+ const struct queue_list_entry *entry;
+
+ TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+ const union queue *q = &entry->queue;
+ int ret = queue_stopped(q->portid, q->qid);
+ if (ret != 1)
+ return ret;
+ }
+ return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+ enum power_management_env env;
+
+ /* only PSTATE and ACPI modes are supported */
+ if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+ !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+ return -ENOTSUP;
+ }
+ /* ensure we could initialize the power library */
+ if (rte_power_init(lcore))
+ return -EINVAL;
+
+ /* ensure we initialized the correct env */
+ env = rte_power_get_env();
+ if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+ struct rte_power_monitor_cond dummy;
+
+ /* check if rte_power_monitor is supported */
+ if (!global_data.intrinsics_support.power_monitor) {
+ RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+ return -ENOTSUP;
+ }
+
+ if (cfg->n_queues > 0) {
+ RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+ return -ENOTSUP;
+ }
+
+ /* check if the device supports the necessary PMD API */
+ if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+ &dummy) == -ENOTSUP) {
+ RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
struct rte_eth_dev_info info;
rte_rx_callback_fn clb;
int ret;
@@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
+ /* if callback was already enabled, check current callback type */
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+ lcore_cfg->cb_mode != mode) {
ret = -EINVAL;
goto end;
}
@@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
switch (mode) {
case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- struct rte_power_monitor_cond dummy;
-
- /* check if rte_power_monitor is supported */
- if (!global_data.intrinsics_support.power_monitor) {
- RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_monitor(lcore_cfg, &qdata);
+ if (ret < 0)
goto end;
- }
- /* check if the device supports the necessary PMD API */
- if (rte_eth_get_monitor_addr(port_id, queue_id,
- &dummy) == -ENOTSUP) {
- RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_umwait;
break;
- }
case RTE_POWER_MGMT_TYPE_SCALE:
- {
- enum power_management_env env;
- /* only PSTATE and ACPI modes are supported */
- if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
- !rte_power_check_env_supported(
- PM_ENV_PSTATE_CPUFREQ)) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_scale(lcore_id);
+ if (ret < 0)
goto end;
- }
- /* ensure we could initialize the power library */
- if (rte_power_init(lcore_id)) {
- ret = -EINVAL;
- goto end;
- }
- /* ensure we initialized the correct env */
- env = rte_power_get_env();
- if (env != PM_ENV_ACPI_CPUFREQ &&
- env != PM_ENV_PSTATE_CPUFREQ) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_scale_freq;
break;
- }
case RTE_POWER_MGMT_TYPE_PAUSE:
/* figure out various time-to-tsc conversions */
if (global_data.tsc_per_us == 0)
@@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -EINVAL;
goto end;
}
+ /* add this queue to the list */
+ ret = queue_list_add(lcore_cfg, &qdata);
+ if (ret < 0) {
+ RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+ strerror(-ret));
+ goto end;
+ }
+ /* new queue is always added last */
+ queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
/* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb, NULL);
+ if (lcore_cfg->n_queues == 1) {
+ lcore_cfg->cb_mode = mode;
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ }
+ queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, queue_cfg);
ret = 0;
end:
@@ -290,7 +476,9 @@ int
rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,24 +494,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
}
/* no need to check queue id as wrong queue id would not be enabled */
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
return -EINVAL;
- /* stop any callbacks from progressing */
- queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+ /*
+ * There is no good/easy way to do this without race conditions, so we
+ * are just going to throw our hands in the air and hope that the user
+ * has read the documentation and has ensured that ports are stopped at
+ * the time we enter the API functions.
+ */
+ queue_cfg = queue_list_take(lcore_cfg, &qdata);
+ if (queue_cfg == NULL)
+ return -ENOENT;
- switch (queue_cfg->cb_mode) {
+ /* if we've removed all queues from the lists, set state to disabled */
+ if (lcore_cfg->n_queues == 0)
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+ switch (lcore_cfg->cb_mode) {
case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
break;
case RTE_POWER_MGMT_TYPE_SCALE:
rte_power_freq_max(lcore_id);
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
rte_power_exit(lcore_id);
break;
}
@@ -332,7 +536,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
* ports before calling any of these API's, so we can assume that the
* callbacks can be freed. we're intentionally casting away const-ness.
*/
- rte_free((void *)queue_cfg->cur_cb);
+ rte_free((void *)queue_cfg->cb);
+ free(queue_cfg);
return 0;
}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+ size_t i;
+
+ /* initialize all tailqs */
+ for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
+ struct pmd_core_cfg *cfg = &lcore_cfgs[i];
+ TAILQ_INIT(&cfg->head);
+ }
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-30 9:52 ` David Hunt
2021-07-01 9:01 ` David Hunt
2021-06-30 11:04 ` Ananyev, Konstantin
1 sibling, 1 reply; 165+ messages in thread
From: David Hunt @ 2021-06-30 9:52 UTC (permalink / raw)
To: Anatoly Burakov, dev; +Cc: konstantin.ananyev, ciara.loftus
Hi Anatoly,
On 29/6/2021 4:48 PM, Anatoly Burakov wrote:
> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
>
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing. This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
>
> - Replace per-queue structures with per-lcore ones, so that any device
> polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
> added to the list of queues to poll, so that the callback is aware of
> other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
> shared between all queues polled on a particular lcore, and is only
> activated when all queues in the list were polled and were determined
> to have no traffic.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
> is incapable of monitoring more than one address.
>
> Also, while we're at it, update and improve the docs.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
> v5:
> - Remove the "power save queue" API and replace it with mechanism suggested by
> Konstantin
>
> v3:
> - Move the list of supported NICs to NIC feature table
>
> v2:
> - Use a TAILQ for queues instead of a static array
> - Address feedback from Konstantin
> - Add additional checks for stopped queues
>
> doc/guides/nics/features.rst | 10 +
> doc/guides/prog_guide/power_man.rst | 65 ++--
> doc/guides/rel_notes/release_21_08.rst | 3 +
> lib/power/rte_power_pmd_mgmt.c | 431 ++++++++++++++++++-------
> 4 files changed, 373 insertions(+), 136 deletions(-)
>
--snip--
> int
> rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
> {
> - struct pmd_queue_cfg *queue_cfg;
> + const union queue qdata = {.portid = port_id, .qid = queue_id};
> + struct pmd_core_cfg *lcore_cfg;
> + struct queue_list_entry *queue_cfg;
> struct rte_eth_dev_info info;
> rte_rx_callback_fn clb;
> int ret;
> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> goto end;
> }
>
> - queue_cfg = &port_cfg[port_id][queue_id];
> + lcore_cfg = &lcore_cfgs[lcore_id];
>
> - if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
> + /* check if other queues are stopped as well */
> + ret = cfg_queues_stopped(lcore_cfg);
> + if (ret != 1) {
> + /* error means invalid queue, 0 means queue wasn't stopped */
> + ret = ret < 0 ? -EINVAL : -EBUSY;
> + goto end;
> + }
> +
> + /* if callback was already enabled, check current callback type */
> + if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
> + lcore_cfg->cb_mode != mode) {
> ret = -EINVAL;
> goto end;
> }
> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>
> switch (mode) {
> case RTE_POWER_MGMT_TYPE_MONITOR:
> - {
> - struct rte_power_monitor_cond dummy;
> -
> - /* check if rte_power_monitor is supported */
> - if (!global_data.intrinsics_support.power_monitor) {
> - RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> - ret = -ENOTSUP;
> + /* check if we can add a new queue */
> + ret = check_monitor(lcore_cfg, &qdata);
> + if (ret < 0)
> goto end;
> - }
>
> - /* check if the device supports the necessary PMD API */
> - if (rte_eth_get_monitor_addr(port_id, queue_id,
> - &dummy) == -ENOTSUP) {
> - RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
> - ret = -ENOTSUP;
> - goto end;
> - }
> clb = clb_umwait;
> break;
> - }
> case RTE_POWER_MGMT_TYPE_SCALE:
> - {
> - enum power_management_env env;
> - /* only PSTATE and ACPI modes are supported */
> - if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> - !rte_power_check_env_supported(
> - PM_ENV_PSTATE_CPUFREQ)) {
> - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> - ret = -ENOTSUP;
> + /* check if we can add a new queue */
> + ret = check_scale(lcore_id);
> + if (ret < 0)
> goto end;
> - }
> - /* ensure we could initialize the power library */
> - if (rte_power_init(lcore_id)) {
> - ret = -EINVAL;
> - goto end;
> - }
> - /* ensure we initialized the correct env */
> - env = rte_power_get_env();
> - if (env != PM_ENV_ACPI_CPUFREQ &&
> - env != PM_ENV_PSTATE_CPUFREQ) {
> - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> - ret = -ENOTSUP;
> - goto end;
> - }
> clb = clb_scale_freq;
> break;
> - }
> case RTE_POWER_MGMT_TYPE_PAUSE:
> /* figure out various time-to-tsc conversions */
> if (global_data.tsc_per_us == 0)
> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> ret = -EINVAL;
> goto end;
> }
> + /* add this queue to the list */
> + ret = queue_list_add(lcore_cfg, &qdata);
> + if (ret < 0) {
> + RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
> + strerror(-ret));
> + goto end;
> + }
> + /* new queue is always added last */
> + queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
Need to ensure that queue_cfg gets set here, otherwise we'll get a
segfault below.
>
> /* initialize data before enabling the callback */
> - queue_cfg->empty_poll_stats = 0;
> - queue_cfg->cb_mode = mode;
> - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> - clb, NULL);
> + if (lcore_cfg->n_queues == 1) {
> + lcore_cfg->cb_mode = mode;
> + lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> + }
> + queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
> + clb, queue_cfg);
--snip--
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
2021-06-30 9:52 ` David Hunt
@ 2021-07-01 9:01 ` David Hunt
2021-07-05 10:24 ` Burakov, Anatoly
0 siblings, 1 reply; 165+ messages in thread
From: David Hunt @ 2021-07-01 9:01 UTC (permalink / raw)
To: Anatoly Burakov, dev; +Cc: konstantin.ananyev, ciara.loftus
On 30/6/2021 10:52 AM, David Hunt wrote:
> Hi Anatoly,
>
> On 29/6/2021 4:48 PM, Anatoly Burakov wrote:
>> Currently, there is a hard limitation on the PMD power management
>> support that only allows it to support a single queue per lcore. This is
>> not ideal as most DPDK use cases will poll multiple queues per core.
>>
>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>> is very difficult to implement such support because callbacks are
>> effectively stateless and have no visibility into what the other ethdev
>> devices are doing. This places limitations on what we can do within the
>> framework of Rx callbacks, but the basics of this implementation are as
>> follows:
>>
>> - Replace per-queue structures with per-lcore ones, so that any device
>> polled from the same lcore can share data
>> - Any queue that is going to be polled from a specific lcore has to be
>> added to the list of queues to poll, so that the callback is aware of
>> other queues being polled by the same lcore
>> - Both the empty poll counter and the actual power saving mechanism is
>> shared between all queues polled on a particular lcore, and is only
>> activated when all queues in the list were polled and were determined
>> to have no traffic.
>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>> is incapable of monitoring more than one address.
>>
>> Also, while we're at it, update and improve the docs.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>> v5:
>> - Remove the "power save queue" API and replace it with
>> mechanism suggested by
>> Konstantin
>> v3:
>> - Move the list of supported NICs to NIC feature table
>> v2:
>> - Use a TAILQ for queues instead of a static array
>> - Address feedback from Konstantin
>> - Add additional checks for stopped queues
>>
>> doc/guides/nics/features.rst | 10 +
>> doc/guides/prog_guide/power_man.rst | 65 ++--
>> doc/guides/rel_notes/release_21_08.rst | 3 +
>> lib/power/rte_power_pmd_mgmt.c | 431 ++++++++++++++++++-------
>> 4 files changed, 373 insertions(+), 136 deletions(-)
>>
>
> --snip--
>
>> int
>> rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t
>> port_id,
>> uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>> {
>> - struct pmd_queue_cfg *queue_cfg;
>> + const union queue qdata = {.portid = port_id, .qid = queue_id};
>> + struct pmd_core_cfg *lcore_cfg;
>> + struct queue_list_entry *queue_cfg;
>> struct rte_eth_dev_info info;
>> rte_rx_callback_fn clb;
>> int ret;
>> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int
>> lcore_id, uint16_t port_id,
>> goto end;
>> }
>> - queue_cfg = &port_cfg[port_id][queue_id];
>> + lcore_cfg = &lcore_cfgs[lcore_id];
>> - if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
>> + /* check if other queues are stopped as well */
>> + ret = cfg_queues_stopped(lcore_cfg);
>> + if (ret != 1) {
>> + /* error means invalid queue, 0 means queue wasn't stopped */
>> + ret = ret < 0 ? -EINVAL : -EBUSY;
>> + goto end;
>> + }
>> +
>> + /* if callback was already enabled, check current callback type */
>> + if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
>> + lcore_cfg->cb_mode != mode) {
>> ret = -EINVAL;
>> goto end;
>> }
>> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned
>> int lcore_id, uint16_t port_id,
>> switch (mode) {
>> case RTE_POWER_MGMT_TYPE_MONITOR:
>> - {
>> - struct rte_power_monitor_cond dummy;
>> -
>> - /* check if rte_power_monitor is supported */
>> - if (!global_data.intrinsics_support.power_monitor) {
>> - RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not
>> supported\n");
>> - ret = -ENOTSUP;
>> + /* check if we can add a new queue */
>> + ret = check_monitor(lcore_cfg, &qdata);
>> + if (ret < 0)
>> goto end;
>> - }
>> - /* check if the device supports the necessary PMD API */
>> - if (rte_eth_get_monitor_addr(port_id, queue_id,
>> - &dummy) == -ENOTSUP) {
>> - RTE_LOG(DEBUG, POWER, "The device does not support
>> rte_eth_get_monitor_addr\n");
>> - ret = -ENOTSUP;
>> - goto end;
>> - }
>> clb = clb_umwait;
>> break;
>> - }
>> case RTE_POWER_MGMT_TYPE_SCALE:
>> - {
>> - enum power_management_env env;
>> - /* only PSTATE and ACPI modes are supported */
>> - if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>> - !rte_power_check_env_supported(
>> - PM_ENV_PSTATE_CPUFREQ)) {
>> - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are
>> supported\n");
>> - ret = -ENOTSUP;
>> + /* check if we can add a new queue */
>> + ret = check_scale(lcore_id);
>> + if (ret < 0)
>> goto end;
>> - }
>> - /* ensure we could initialize the power library */
>> - if (rte_power_init(lcore_id)) {
>> - ret = -EINVAL;
>> - goto end;
>> - }
>> - /* ensure we initialized the correct env */
>> - env = rte_power_get_env();
>> - if (env != PM_ENV_ACPI_CPUFREQ &&
>> - env != PM_ENV_PSTATE_CPUFREQ) {
>> - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes
>> were initialized\n");
>> - ret = -ENOTSUP;
>> - goto end;
>> - }
>> clb = clb_scale_freq;
>> break;
>> - }
>> case RTE_POWER_MGMT_TYPE_PAUSE:
>> /* figure out various time-to-tsc conversions */
>> if (global_data.tsc_per_us == 0)
>> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned
>> int lcore_id, uint16_t port_id,
>> ret = -EINVAL;
>> goto end;
>> }
>> + /* add this queue to the list */
>> + ret = queue_list_add(lcore_cfg, &qdata);
>> + if (ret < 0) {
>> + RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
>> + strerror(-ret));
>> + goto end;
>> + }
>> + /* new queue is always added last */
>> + queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
>
>
> Need to ensure that queue_cfg gets set here, otherwise we'll get a
> segfault below.
>
Or, looking at this again, shouldn't "lcore_cfgs" be "lcore_cfg"?
>
>
>> /* initialize data before enabling the callback */
>> - queue_cfg->empty_poll_stats = 0;
>> - queue_cfg->cb_mode = mode;
>> - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> - clb, NULL);
>> + if (lcore_cfg->n_queues == 1) {
>> + lcore_cfg->cb_mode = mode;
>> + lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> + }
>> + queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
>> + clb, queue_cfg);
> --snip--
>
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
2021-07-01 9:01 ` David Hunt
@ 2021-07-05 10:24 ` Burakov, Anatoly
0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-05 10:24 UTC (permalink / raw)
To: David Hunt, dev; +Cc: konstantin.ananyev, ciara.loftus
On 01-Jul-21 10:01 AM, David Hunt wrote:
>
> On 30/6/2021 10:52 AM, David Hunt wrote:
>> Hi Anatoly,
>>
>> On 29/6/2021 4:48 PM, Anatoly Burakov wrote:
>>> Currently, there is a hard limitation on the PMD power management
>>> support that only allows it to support a single queue per lcore. This is
>>> not ideal as most DPDK use cases will poll multiple queues per core.
>>>
>>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>>> is very difficult to implement such support because callbacks are
>>> effectively stateless and have no visibility into what the other ethdev
>>> devices are doing. This places limitations on what we can do within the
>>> framework of Rx callbacks, but the basics of this implementation are as
>>> follows:
>>>
>>> - Replace per-queue structures with per-lcore ones, so that any device
>>> polled from the same lcore can share data
>>> - Any queue that is going to be polled from a specific lcore has to be
>>> added to the list of queues to poll, so that the callback is aware of
>>> other queues being polled by the same lcore
>>> - Both the empty poll counter and the actual power saving mechanism is
>>> shared between all queues polled on a particular lcore, and is only
>>> activated when all queues in the list were polled and were determined
>>> to have no traffic.
>>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>> is incapable of monitoring more than one address.
>>>
>>> Also, while we're at it, update and improve the docs.
>>>
>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>> ---
>>>
>>> Notes:
>>> v5:
>>> - Remove the "power save queue" API and replace it with
>>> mechanism suggested by
>>> Konstantin
>>> v3:
>>> - Move the list of supported NICs to NIC feature table
>>> v2:
>>> - Use a TAILQ for queues instead of a static array
>>> - Address feedback from Konstantin
>>> - Add additional checks for stopped queues
>>>
>>> doc/guides/nics/features.rst | 10 +
>>> doc/guides/prog_guide/power_man.rst | 65 ++--
>>> doc/guides/rel_notes/release_21_08.rst | 3 +
>>> lib/power/rte_power_pmd_mgmt.c | 431 ++++++++++++++++++-------
>>> 4 files changed, 373 insertions(+), 136 deletions(-)
>>>
>>
>> --snip--
>>
>>> int
>>> rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t
>>> port_id,
>>> uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>>> {
>>> - struct pmd_queue_cfg *queue_cfg;
>>> + const union queue qdata = {.portid = port_id, .qid = queue_id};
>>> + struct pmd_core_cfg *lcore_cfg;
>>> + struct queue_list_entry *queue_cfg;
>>> struct rte_eth_dev_info info;
>>> rte_rx_callback_fn clb;
>>> int ret;
>>> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int
>>> lcore_id, uint16_t port_id,
>>> goto end;
>>> }
>>> - queue_cfg = &port_cfg[port_id][queue_id];
>>> + lcore_cfg = &lcore_cfgs[lcore_id];
>>> - if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
>>> + /* check if other queues are stopped as well */
>>> + ret = cfg_queues_stopped(lcore_cfg);
>>> + if (ret != 1) {
>>> + /* error means invalid queue, 0 means queue wasn't stopped */
>>> + ret = ret < 0 ? -EINVAL : -EBUSY;
>>> + goto end;
>>> + }
>>> +
>>> + /* if callback was already enabled, check current callback type */
>>> + if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
>>> + lcore_cfg->cb_mode != mode) {
>>> ret = -EINVAL;
>>> goto end;
>>> }
>>> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned
>>> int lcore_id, uint16_t port_id,
>>> switch (mode) {
>>> case RTE_POWER_MGMT_TYPE_MONITOR:
>>> - {
>>> - struct rte_power_monitor_cond dummy;
>>> -
>>> - /* check if rte_power_monitor is supported */
>>> - if (!global_data.intrinsics_support.power_monitor) {
>>> - RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not
>>> supported\n");
>>> - ret = -ENOTSUP;
>>> + /* check if we can add a new queue */
>>> + ret = check_monitor(lcore_cfg, &qdata);
>>> + if (ret < 0)
>>> goto end;
>>> - }
>>> - /* check if the device supports the necessary PMD API */
>>> - if (rte_eth_get_monitor_addr(port_id, queue_id,
>>> - &dummy) == -ENOTSUP) {
>>> - RTE_LOG(DEBUG, POWER, "The device does not support
>>> rte_eth_get_monitor_addr\n");
>>> - ret = -ENOTSUP;
>>> - goto end;
>>> - }
>>> clb = clb_umwait;
>>> break;
>>> - }
>>> case RTE_POWER_MGMT_TYPE_SCALE:
>>> - {
>>> - enum power_management_env env;
>>> - /* only PSTATE and ACPI modes are supported */
>>> - if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>>> - !rte_power_check_env_supported(
>>> - PM_ENV_PSTATE_CPUFREQ)) {
>>> - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are
>>> supported\n");
>>> - ret = -ENOTSUP;
>>> + /* check if we can add a new queue */
>>> + ret = check_scale(lcore_id);
>>> + if (ret < 0)
>>> goto end;
>>> - }
>>> - /* ensure we could initialize the power library */
>>> - if (rte_power_init(lcore_id)) {
>>> - ret = -EINVAL;
>>> - goto end;
>>> - }
>>> - /* ensure we initialized the correct env */
>>> - env = rte_power_get_env();
>>> - if (env != PM_ENV_ACPI_CPUFREQ &&
>>> - env != PM_ENV_PSTATE_CPUFREQ) {
>>> - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes
>>> were initialized\n");
>>> - ret = -ENOTSUP;
>>> - goto end;
>>> - }
>>> clb = clb_scale_freq;
>>> break;
>>> - }
>>> case RTE_POWER_MGMT_TYPE_PAUSE:
>>> /* figure out various time-to-tsc conversions */
>>> if (global_data.tsc_per_us == 0)
>>> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned
>>> int lcore_id, uint16_t port_id,
>>> ret = -EINVAL;
>>> goto end;
>>> }
>>> + /* add this queue to the list */
>>> + ret = queue_list_add(lcore_cfg, &qdata);
>>> + if (ret < 0) {
>>> + RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
>>> + strerror(-ret));
>>> + goto end;
>>> + }
>>> + /* new queue is always added last */
>>> + queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
>>
>>
>> Need to ensure that queue_cfg gets set here, otherwise we'll get a
>> segfault below.
>>
>
> Or, looking at this again, shouldn't "lcore_cfgs" be "lcore_cfg"?
Good catch, will fix!
>
>
>>
>>
>>> /* initialize data before enabling the callback */
>>> - queue_cfg->empty_poll_stats = 0;
>>> - queue_cfg->cb_mode = mode;
>>> - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>>> - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>>> - clb, NULL);
>>> + if (lcore_cfg->n_queues == 1) {
>>> + lcore_cfg->cb_mode = mode;
>>> + lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>>> + }
>>> + queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
>>> + clb, queue_cfg);
>> --snip--
>>
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
2021-06-30 9:52 ` David Hunt
@ 2021-06-30 11:04 ` Ananyev, Konstantin
2021-07-05 10:23 ` Burakov, Anatoly
1 sibling, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-30 11:04 UTC (permalink / raw)
To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara
> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
>
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing. This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
>
> - Replace per-queue structures with per-lcore ones, so that any device
> polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
> added to the list of queues to poll, so that the callback is aware of
> other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
> shared between all queues polled on a particular lcore, and is only
> activated when all queues in the list were polled and were determined
> to have no traffic.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
> is incapable of monitoring more than one address.
>
> Also, while we're at it, update and improve the docs.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
> v5:
> - Remove the "power save queue" API and replace it with mechanism suggested by
> Konstantin
>
> v3:
> - Move the list of supported NICs to NIC feature table
>
> v2:
> - Use a TAILQ for queues instead of a static array
> - Address feedback from Konstantin
> - Add additional checks for stopped queues
>
> doc/guides/nics/features.rst | 10 +
> doc/guides/prog_guide/power_man.rst | 65 ++--
> doc/guides/rel_notes/release_21_08.rst | 3 +
> lib/power/rte_power_pmd_mgmt.c | 431 ++++++++++++++++++-------
> 4 files changed, 373 insertions(+), 136 deletions(-)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index 403c2b03a3..a96e12d155 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
> * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
> * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
>
> +.. _nic_features_get_monitor_addr:
> +
> +PMD power management using monitor addresses
> +--------------------------------------------
> +
> +Supports getting a monitoring condition to use together with Ethernet PMD power
> +management (see :doc:`../prog_guide/power_man` for more details).
> +
> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
> +
> .. _nic_features_other:
>
> Other dev ops not represented by a Feature
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index c70ae128ac..ec04a72108 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -198,34 +198,41 @@ Ethernet PMD Power Management API
> Abstract
> ~~~~~~~~
>
> -Existing power management mechanisms require developers
> -to change application design or change code to make use of it.
> -The PMD power management API provides a convenient alternative
> -by utilizing Ethernet PMD RX callbacks,
> -and triggering power saving whenever empty poll count reaches a certain number.
> -
> -Monitor
> - This power saving scheme will put the CPU into optimized power state
> - and use the ``rte_power_monitor()`` function
> - to monitor the Ethernet PMD RX descriptor address,
> - and wake the CPU up whenever there's new traffic.
> -
> -Pause
> - This power saving scheme will avoid busy polling
> - by either entering power-optimized sleep state
> - with ``rte_power_pause()`` function,
> - or, if it's not available, use ``rte_pause()``.
> -
> -Frequency scaling
> - This power saving scheme will use ``librte_power`` library
> - functionality to scale the core frequency up/down
> - depending on traffic volume.
> -
> -.. note::
> -
> - Currently, this power management API is limited to mandatory mapping
> - of 1 queue to 1 core (multiple queues are supported,
> - but they must be polled from different cores).
> +Existing power management mechanisms require developers to change application
> +design or change code to make use of it. The PMD power management API provides a
> +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
> +power saving whenever empty poll count reaches a certain number.
> +
> +* Monitor
> + This power saving scheme will put the CPU into optimized power state and
> + monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
> + there's new traffic. Support for this scheme may not be available on all
> + platforms, and further limitations may apply (see below).
> +
> +* Pause
> + This power saving scheme will avoid busy polling by either entering
> + power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
> + not supported by the underlying platform, use ``rte_pause()``.
> +
> +* Frequency scaling
> + This power saving scheme will use ``librte_power`` library functionality to
> + scale the core frequency up/down depending on traffic volume.
> +
> +The "monitor" mode is only supported in the following configurations and scenarios:
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that
> + ``rte_power_monitor()`` is supported by the platform, then monitoring will be
> + limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
> + monitored from a different lcore).
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
> + ``rte_power_monitor()`` function is not supported, then monitor mode will not
> + be supported.
> +
> +* Not all Ethernet drivers support monitoring, even if the underlying
> + platform may support the necessary CPU instructions. Please refer to
> + :doc:`../nics/overview` for more information.
> +
>
> API Overview for Ethernet PMD Power Management
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ -242,3 +249,5 @@ References
>
> * The :doc:`../sample_app_ug/vm_power_management`
> chapter in the :doc:`../sample_app_ug/index` section.
> +
> +* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
> diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
> index f015c509fc..3926d45ef8 100644
> --- a/doc/guides/rel_notes/release_21_08.rst
> +++ b/doc/guides/rel_notes/release_21_08.rst
> @@ -57,6 +57,9 @@ New Features
>
> * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
>
> +* rte_power: The experimental PMD power management API now supports managing
> + multiple Ethernet Rx queues per lcore.
> +
>
> Removed Items
> -------------
> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> index 9b95cf1794..fccfd236c2 100644
> --- a/lib/power/rte_power_pmd_mgmt.c
> +++ b/lib/power/rte_power_pmd_mgmt.c
> @@ -33,18 +33,96 @@ enum pmd_mgmt_state {
> PMD_MGMT_ENABLED
> };
>
> -struct pmd_queue_cfg {
> +union queue {
> + uint32_t val;
> + struct {
> + uint16_t portid;
> + uint16_t qid;
> + };
> +};
> +
> +struct queue_list_entry {
> + TAILQ_ENTRY(queue_list_entry) next;
> + union queue queue;
> + uint64_t n_empty_polls;
> + const struct rte_eth_rxtx_callback *cb;
> +};
> +
> +struct pmd_core_cfg {
> + TAILQ_HEAD(queue_list_head, queue_list_entry) head;
> + /**< List of queues associated with this lcore */
> + size_t n_queues;
> + /**< How many queues are in the list? */
> volatile enum pmd_mgmt_state pwr_mgmt_state;
> /**< State of power management for this queue */
> enum rte_power_pmd_mgmt_type cb_mode;
> /**< Callback mode for this queue */
> - const struct rte_eth_rxtx_callback *cur_cb;
> - /**< Callback instance */
> - uint64_t empty_poll_stats;
> - /**< Number of empty polls */
> + uint64_t n_queues_ready_to_sleep;
> + /**< Number of queues ready to enter power optimized state */
> } __rte_cache_aligned;
> +static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
>
> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
> +static inline bool
> +queue_equal(const union queue *l, const union queue *r)
> +{
> + return l->val == r->val;
> +}
> +
> +static inline void
> +queue_copy(union queue *dst, const union queue *src)
> +{
> + dst->val = src->val;
> +}
> +
> +static struct queue_list_entry *
> +queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
> +{
> + struct queue_list_entry *cur;
> +
> + TAILQ_FOREACH(cur, &cfg->head, next) {
> + if (queue_equal(&cur->queue, q))
> + return cur;
> + }
> + return NULL;
> +}
> +
> +static int
> +queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
> +{
> + struct queue_list_entry *qle;
> +
> + /* is it already in the list? */
> + if (queue_list_find(cfg, q) != NULL)
> + return -EEXIST;
> +
> + qle = malloc(sizeof(*qle));
> + if (qle == NULL)
> + return -ENOMEM;
> + memset(qle, 0, sizeof(*qle));
> +
> + queue_copy(&qle->queue, q);
> + TAILQ_INSERT_TAIL(&cfg->head, qle, next);
> + cfg->n_queues++;
> + qle->n_empty_polls = 0;
> +
> + return 0;
> +}
> +
> +static struct queue_list_entry *
> +queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
> +{
> + struct queue_list_entry *found;
> +
> + found = queue_list_find(cfg, q);
> + if (found == NULL)
> + return NULL;
> +
> + TAILQ_REMOVE(&cfg->head, found, next);
> + cfg->n_queues--;
> +
> + /* freeing is responsibility of the caller */
> + return found;
> +}
>
> static void
> calc_tsc(void)
> @@ -74,21 +152,56 @@ calc_tsc(void)
> }
> }
>
> +static inline void
> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> + /* reset empty poll counter for this queue */
> + qcfg->n_empty_polls = 0;
> + /* reset the sleep counter too */
> + cfg->n_queues_ready_to_sleep = 0;
> +}
> +
> +static inline bool
> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> + /* this function is called - that means we have an empty poll */
> + qcfg->n_empty_polls++;
> +
> + /* if we haven't reached threshold for empty polls, we can't sleep */
> + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
> + return false;
> +
> + /* we're ready to sleep */
> + cfg->n_queues_ready_to_sleep++;
> +
> + return true;
> +}
> +
> +static inline bool
> +lcore_can_sleep(struct pmd_core_cfg *cfg)
> +{
> + /* are all queues ready to sleep? */
> + if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
> + return false;
> +
> + /* we've reached an iteration where we can sleep, reset sleep counter */
> + cfg->n_queues_ready_to_sleep = 0;
> +
> + return true;
> +}
As I can see it a slightly modified one from what was discussed.
I understand that it seems simpler, but I think there are some problems with it:
- each queue can be counted more than once at lcore_cfg->n_queues_ready_to_sleep
- queues n_empty_polls are not reset after sleep().
To illustrate the problem, let say we have 2 queues, and at some moment we have:
q0.n_empty_polls == EMPTYPOLL_MAX + 1
q1.n_empty_polls == EMPTYPOLL_MAX + 1
cfg->n_queues_ready_to_sleep == 2
So lcore_can_sleep() returns 'true' and sets:
cfg->n_queues_ready_to_sleep == 0
Now, after sleep():
q0.n_empty_polls == EMPTYPOLL_MAX + 1
q1.n_empty_polls == EMPTYPOLL_MAX + 1
So after:
queue_can_sleep(q0);
queue_can_sleep(q1);
will have:
cfg->n_queues_ready_to_sleep == 2
again, and we'll go to another sleep after just one rx_burst() attempt for each queue.
> +
> static uint16_t
> clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> - uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> - void *addr __rte_unused)
> + uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
> {
> + struct queue_list_entry *queue_conf = arg;
>
> - struct pmd_queue_cfg *q_conf;
> -
> - q_conf = &port_cfg[port_id][qidx];
> -
> + /* this callback can't do more than one queue, omit multiqueue logic */
> if (unlikely(nb_rx == 0)) {
> - q_conf->empty_poll_stats++;
> - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> + queue_conf->n_empty_polls++;
> + if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
> struct rte_power_monitor_cond pmc;
> - uint16_t ret;
> + int ret;
>
> /* use monitoring condition to sleep */
> ret = rte_eth_get_monitor_addr(port_id, qidx,
> @@ -97,60 +210,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> rte_power_monitor(&pmc, UINT64_MAX);
> }
> } else
> - q_conf->empty_poll_stats = 0;
> + queue_conf->n_empty_polls = 0;
>
> return nb_rx;
> }
>
> static uint16_t
> -clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> - uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> - void *addr __rte_unused)
> +clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
> + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> + uint16_t max_pkts __rte_unused, void *arg)
> {
> - struct pmd_queue_cfg *q_conf;
> + const unsigned int lcore = rte_lcore_id();
> + struct queue_list_entry *queue_conf = arg;
> + struct pmd_core_cfg *lcore_conf;
> + const bool empty = nb_rx == 0;
>
> - q_conf = &port_cfg[port_id][qidx];
> + lcore_conf = &lcore_cfgs[lcore];
>
> - if (unlikely(nb_rx == 0)) {
> - q_conf->empty_poll_stats++;
> - /* sleep for 1 microsecond */
> - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> - /* use tpause if we have it */
> - if (global_data.intrinsics_support.power_pause) {
> - const uint64_t cur = rte_rdtsc();
> - const uint64_t wait_tsc =
> - cur + global_data.tsc_per_us;
> - rte_power_pause(wait_tsc);
> - } else {
> - uint64_t i;
> - for (i = 0; i < global_data.pause_per_us; i++)
> - rte_pause();
> - }
> + if (likely(!empty))
> + /* early exit */
> + queue_reset(lcore_conf, queue_conf);
> + else {
> + /* can this queue sleep? */
> + if (!queue_can_sleep(lcore_conf, queue_conf))
> + return nb_rx;
> +
> + /* can this lcore sleep? */
> + if (!lcore_can_sleep(lcore_conf))
> + return nb_rx;
> +
> + /* sleep for 1 microsecond, use tpause if we have it */
> + if (global_data.intrinsics_support.power_pause) {
> + const uint64_t cur = rte_rdtsc();
> + const uint64_t wait_tsc =
> + cur + global_data.tsc_per_us;
> + rte_power_pause(wait_tsc);
> + } else {
> + uint64_t i;
> + for (i = 0; i < global_data.pause_per_us; i++)
> + rte_pause();
> }
> - } else
> - q_conf->empty_poll_stats = 0;
> + }
>
> return nb_rx;
> }
>
> static uint16_t
> -clb_scale_freq(uint16_t port_id, uint16_t qidx,
> +clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
> struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> - uint16_t max_pkts __rte_unused, void *_ __rte_unused)
> + uint16_t max_pkts __rte_unused, void *arg)
> {
> - struct pmd_queue_cfg *q_conf;
> + const unsigned int lcore = rte_lcore_id();
> + const bool empty = nb_rx == 0;
> + struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
> + struct queue_list_entry *queue_conf = arg;
>
> - q_conf = &port_cfg[port_id][qidx];
> + if (likely(!empty)) {
> + /* early exit */
> + queue_reset(lcore_conf, queue_conf);
>
> - if (unlikely(nb_rx == 0)) {
> - q_conf->empty_poll_stats++;
> - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
> - /* scale down freq */
> - rte_power_freq_min(rte_lcore_id());
> - } else {
> - q_conf->empty_poll_stats = 0;
> - /* scale up freq */
> + /* scale up freq immediately */
> rte_power_freq_max(rte_lcore_id());
> + } else {
> + /* can this queue sleep? */
> + if (!queue_can_sleep(lcore_conf, queue_conf))
> + return nb_rx;
> +
> + /* can this lcore sleep? */
> + if (!lcore_can_sleep(lcore_conf))
> + return nb_rx;
> +
> + rte_power_freq_min(rte_lcore_id());
> }
>
> return nb_rx;
> @@ -167,11 +297,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
> return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
> }
>
> +static int
> +cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
> +{
> + const struct queue_list_entry *entry;
> +
> + TAILQ_FOREACH(entry, &queue_cfg->head, next) {
> + const union queue *q = &entry->queue;
> + int ret = queue_stopped(q->portid, q->qid);
> + if (ret != 1)
> + return ret;
> + }
> + return 1;
> +}
> +
> +static int
> +check_scale(unsigned int lcore)
> +{
> + enum power_management_env env;
> +
> + /* only PSTATE and ACPI modes are supported */
> + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> + !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
> + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> + return -ENOTSUP;
> + }
> + /* ensure we could initialize the power library */
> + if (rte_power_init(lcore))
> + return -EINVAL;
> +
> + /* ensure we initialized the correct env */
> + env = rte_power_get_env();
> + if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
> + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> + return -ENOTSUP;
> + }
> +
> + /* we're done */
> + return 0;
> +}
> +
> +static int
> +check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
> +{
> + struct rte_power_monitor_cond dummy;
> +
> + /* check if rte_power_monitor is supported */
> + if (!global_data.intrinsics_support.power_monitor) {
> + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> + return -ENOTSUP;
> + }
> +
> + if (cfg->n_queues > 0) {
> + RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
> + return -ENOTSUP;
> + }
> +
> + /* check if the device supports the necessary PMD API */
> + if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
> + &dummy) == -ENOTSUP) {
> + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
> + return -ENOTSUP;
> + }
> +
> + /* we're done */
> + return 0;
> +}
> +
> int
> rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
> {
> - struct pmd_queue_cfg *queue_cfg;
> + const union queue qdata = {.portid = port_id, .qid = queue_id};
> + struct pmd_core_cfg *lcore_cfg;
> + struct queue_list_entry *queue_cfg;
> struct rte_eth_dev_info info;
> rte_rx_callback_fn clb;
> int ret;
> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> goto end;
> }
>
> - queue_cfg = &port_cfg[port_id][queue_id];
> + lcore_cfg = &lcore_cfgs[lcore_id];
>
> - if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
> + /* check if other queues are stopped as well */
> + ret = cfg_queues_stopped(lcore_cfg);
> + if (ret != 1) {
> + /* error means invalid queue, 0 means queue wasn't stopped */
> + ret = ret < 0 ? -EINVAL : -EBUSY;
> + goto end;
> + }
> +
> + /* if callback was already enabled, check current callback type */
> + if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
> + lcore_cfg->cb_mode != mode) {
> ret = -EINVAL;
> goto end;
> }
> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>
> switch (mode) {
> case RTE_POWER_MGMT_TYPE_MONITOR:
> - {
> - struct rte_power_monitor_cond dummy;
> -
> - /* check if rte_power_monitor is supported */
> - if (!global_data.intrinsics_support.power_monitor) {
> - RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> - ret = -ENOTSUP;
> + /* check if we can add a new queue */
> + ret = check_monitor(lcore_cfg, &qdata);
> + if (ret < 0)
> goto end;
> - }
>
> - /* check if the device supports the necessary PMD API */
> - if (rte_eth_get_monitor_addr(port_id, queue_id,
> - &dummy) == -ENOTSUP) {
> - RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
> - ret = -ENOTSUP;
> - goto end;
> - }
> clb = clb_umwait;
> break;
> - }
> case RTE_POWER_MGMT_TYPE_SCALE:
> - {
> - enum power_management_env env;
> - /* only PSTATE and ACPI modes are supported */
> - if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> - !rte_power_check_env_supported(
> - PM_ENV_PSTATE_CPUFREQ)) {
> - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> - ret = -ENOTSUP;
> + /* check if we can add a new queue */
> + ret = check_scale(lcore_id);
> + if (ret < 0)
> goto end;
> - }
> - /* ensure we could initialize the power library */
> - if (rte_power_init(lcore_id)) {
> - ret = -EINVAL;
> - goto end;
> - }
> - /* ensure we initialized the correct env */
> - env = rte_power_get_env();
> - if (env != PM_ENV_ACPI_CPUFREQ &&
> - env != PM_ENV_PSTATE_CPUFREQ) {
> - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> - ret = -ENOTSUP;
> - goto end;
> - }
> clb = clb_scale_freq;
> break;
> - }
> case RTE_POWER_MGMT_TYPE_PAUSE:
> /* figure out various time-to-tsc conversions */
> if (global_data.tsc_per_us == 0)
> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> ret = -EINVAL;
> goto end;
> }
> + /* add this queue to the list */
> + ret = queue_list_add(lcore_cfg, &qdata);
> + if (ret < 0) {
> + RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
> + strerror(-ret));
> + goto end;
> + }
> + /* new queue is always added last */
> + queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
>
> /* initialize data before enabling the callback */
> - queue_cfg->empty_poll_stats = 0;
> - queue_cfg->cb_mode = mode;
> - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> - clb, NULL);
> + if (lcore_cfg->n_queues == 1) {
> + lcore_cfg->cb_mode = mode;
> + lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> + }
> + queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
> + clb, queue_cfg);
>
> ret = 0;
> end:
> @@ -290,7 +476,9 @@ int
> rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
> uint16_t port_id, uint16_t queue_id)
> {
> - struct pmd_queue_cfg *queue_cfg;
> + const union queue qdata = {.portid = port_id, .qid = queue_id};
> + struct pmd_core_cfg *lcore_cfg;
> + struct queue_list_entry *queue_cfg;
> int ret;
>
> RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> @@ -306,24 +494,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
> }
>
> /* no need to check queue id as wrong queue id would not be enabled */
> - queue_cfg = &port_cfg[port_id][queue_id];
> + lcore_cfg = &lcore_cfgs[lcore_id];
>
> - if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
> + /* check if other queues are stopped as well */
> + ret = cfg_queues_stopped(lcore_cfg);
> + if (ret != 1) {
> + /* error means invalid queue, 0 means queue wasn't stopped */
> + return ret < 0 ? -EINVAL : -EBUSY;
> + }
> +
> + if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
> return -EINVAL;
>
> - /* stop any callbacks from progressing */
> - queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> + /*
> + * There is no good/easy way to do this without race conditions, so we
> + * are just going to throw our hands in the air and hope that the user
> + * has read the documentation and has ensured that ports are stopped at
> + * the time we enter the API functions.
> + */
> + queue_cfg = queue_list_take(lcore_cfg, &qdata);
> + if (queue_cfg == NULL)
> + return -ENOENT;
>
> - switch (queue_cfg->cb_mode) {
> + /* if we've removed all queues from the lists, set state to disabled */
> + if (lcore_cfg->n_queues == 0)
> + lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +
> + switch (lcore_cfg->cb_mode) {
> case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
> case RTE_POWER_MGMT_TYPE_PAUSE:
> - rte_eth_remove_rx_callback(port_id, queue_id,
> - queue_cfg->cur_cb);
> + rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
> break;
> case RTE_POWER_MGMT_TYPE_SCALE:
> rte_power_freq_max(lcore_id);
> - rte_eth_remove_rx_callback(port_id, queue_id,
> - queue_cfg->cur_cb);
> + rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
> rte_power_exit(lcore_id);
> break;
> }
> @@ -332,7 +536,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
> * ports before calling any of these API's, so we can assume that the
> * callbacks can be freed. we're intentionally casting away const-ness.
> */
> - rte_free((void *)queue_cfg->cur_cb);
> + rte_free((void *)queue_cfg->cb);
> + free(queue_cfg);
>
> return 0;
> }
> +
> +RTE_INIT(rte_power_ethdev_pmgmt_init) {
> + size_t i;
> +
> + /* initialize all tailqs */
> + for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
> + struct pmd_core_cfg *cfg = &lcore_cfgs[i];
> + TAILQ_INIT(&cfg->head);
> + }
> +}
> --
> 2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
2021-06-30 11:04 ` Ananyev, Konstantin
@ 2021-07-05 10:23 ` Burakov, Anatoly
0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-05 10:23 UTC (permalink / raw)
To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara
On 30-Jun-21 12:04 PM, Ananyev, Konstantin wrote:
>
>
>
>> Currently, there is a hard limitation on the PMD power management
>> support that only allows it to support a single queue per lcore. This is
>> not ideal as most DPDK use cases will poll multiple queues per core.
>>
>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>> is very difficult to implement such support because callbacks are
>> effectively stateless and have no visibility into what the other ethdev
>> devices are doing. This places limitations on what we can do within the
>> framework of Rx callbacks, but the basics of this implementation are as
>> follows:
>>
>> - Replace per-queue structures with per-lcore ones, so that any device
>> polled from the same lcore can share data
>> - Any queue that is going to be polled from a specific lcore has to be
>> added to the list of queues to poll, so that the callback is aware of
>> other queues being polled by the same lcore
>> - Both the empty poll counter and the actual power saving mechanism is
>> shared between all queues polled on a particular lcore, and is only
>> activated when all queues in the list were polled and were determined
>> to have no traffic.
>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>> is incapable of monitoring more than one address.
>>
>> Also, while we're at it, update and improve the docs.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>> v5:
>> - Remove the "power save queue" API and replace it with mechanism suggested by
>> Konstantin
>>
>> v3:
>> - Move the list of supported NICs to NIC feature table
>>
>> v2:
>> - Use a TAILQ for queues instead of a static array
>> - Address feedback from Konstantin
>> - Add additional checks for stopped queues
>>
>> doc/guides/nics/features.rst | 10 +
>> doc/guides/prog_guide/power_man.rst | 65 ++--
>> doc/guides/rel_notes/release_21_08.rst | 3 +
>> lib/power/rte_power_pmd_mgmt.c | 431 ++++++++++++++++++-------
>> 4 files changed, 373 insertions(+), 136 deletions(-)
>>
>> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
>> index 403c2b03a3..a96e12d155 100644
>> --- a/doc/guides/nics/features.rst
>> +++ b/doc/guides/nics/features.rst
>> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
>> * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
>> * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
>>
>> +.. _nic_features_get_monitor_addr:
>> +
>> +PMD power management using monitor addresses
>> +--------------------------------------------
>> +
>> +Supports getting a monitoring condition to use together with Ethernet PMD power
>> +management (see :doc:`../prog_guide/power_man` for more details).
>> +
>> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
>> +
>> .. _nic_features_other:
>>
>> Other dev ops not represented by a Feature
>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>> index c70ae128ac..ec04a72108 100644
>> --- a/doc/guides/prog_guide/power_man.rst
>> +++ b/doc/guides/prog_guide/power_man.rst
>> @@ -198,34 +198,41 @@ Ethernet PMD Power Management API
>> Abstract
>> ~~~~~~~~
>>
>> -Existing power management mechanisms require developers
>> -to change application design or change code to make use of it.
>> -The PMD power management API provides a convenient alternative
>> -by utilizing Ethernet PMD RX callbacks,
>> -and triggering power saving whenever empty poll count reaches a certain number.
>> -
>> -Monitor
>> - This power saving scheme will put the CPU into optimized power state
>> - and use the ``rte_power_monitor()`` function
>> - to monitor the Ethernet PMD RX descriptor address,
>> - and wake the CPU up whenever there's new traffic.
>> -
>> -Pause
>> - This power saving scheme will avoid busy polling
>> - by either entering power-optimized sleep state
>> - with ``rte_power_pause()`` function,
>> - or, if it's not available, use ``rte_pause()``.
>> -
>> -Frequency scaling
>> - This power saving scheme will use ``librte_power`` library
>> - functionality to scale the core frequency up/down
>> - depending on traffic volume.
>> -
>> -.. note::
>> -
>> - Currently, this power management API is limited to mandatory mapping
>> - of 1 queue to 1 core (multiple queues are supported,
>> - but they must be polled from different cores).
>> +Existing power management mechanisms require developers to change application
>> +design or change code to make use of it. The PMD power management API provides a
>> +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
>> +power saving whenever empty poll count reaches a certain number.
>> +
>> +* Monitor
>> + This power saving scheme will put the CPU into optimized power state and
>> + monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
>> + there's new traffic. Support for this scheme may not be available on all
>> + platforms, and further limitations may apply (see below).
>> +
>> +* Pause
>> + This power saving scheme will avoid busy polling by either entering
>> + power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
>> + not supported by the underlying platform, use ``rte_pause()``.
>> +
>> +* Frequency scaling
>> + This power saving scheme will use ``librte_power`` library functionality to
>> + scale the core frequency up/down depending on traffic volume.
>> +
>> +The "monitor" mode is only supported in the following configurations and scenarios:
>> +
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that
>> + ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>> + limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>> + monitored from a different lcore).
>> +
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
>> + ``rte_power_monitor()`` function is not supported, then monitor mode will not
>> + be supported.
>> +
>> +* Not all Ethernet drivers support monitoring, even if the underlying
>> + platform may support the necessary CPU instructions. Please refer to
>> + :doc:`../nics/overview` for more information.
>> +
>>
>> API Overview for Ethernet PMD Power Management
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> @@ -242,3 +249,5 @@ References
>>
>> * The :doc:`../sample_app_ug/vm_power_management`
>> chapter in the :doc:`../sample_app_ug/index` section.
>> +
>> +* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
>> diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
>> index f015c509fc..3926d45ef8 100644
>> --- a/doc/guides/rel_notes/release_21_08.rst
>> +++ b/doc/guides/rel_notes/release_21_08.rst
>> @@ -57,6 +57,9 @@ New Features
>>
>> * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
>>
>> +* rte_power: The experimental PMD power management API now supports managing
>> + multiple Ethernet Rx queues per lcore.
>> +
>>
>> Removed Items
>> -------------
>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>> index 9b95cf1794..fccfd236c2 100644
>> --- a/lib/power/rte_power_pmd_mgmt.c
>> +++ b/lib/power/rte_power_pmd_mgmt.c
>> @@ -33,18 +33,96 @@ enum pmd_mgmt_state {
>> PMD_MGMT_ENABLED
>> };
>>
>> -struct pmd_queue_cfg {
>> +union queue {
>> + uint32_t val;
>> + struct {
>> + uint16_t portid;
>> + uint16_t qid;
>> + };
>> +};
>> +
>> +struct queue_list_entry {
>> + TAILQ_ENTRY(queue_list_entry) next;
>> + union queue queue;
>> + uint64_t n_empty_polls;
>> + const struct rte_eth_rxtx_callback *cb;
>> +};
>> +
>> +struct pmd_core_cfg {
>> + TAILQ_HEAD(queue_list_head, queue_list_entry) head;
>> + /**< List of queues associated with this lcore */
>> + size_t n_queues;
>> + /**< How many queues are in the list? */
>> volatile enum pmd_mgmt_state pwr_mgmt_state;
>> /**< State of power management for this queue */
>> enum rte_power_pmd_mgmt_type cb_mode;
>> /**< Callback mode for this queue */
>> - const struct rte_eth_rxtx_callback *cur_cb;
>> - /**< Callback instance */
>> - uint64_t empty_poll_stats;
>> - /**< Number of empty polls */
>> + uint64_t n_queues_ready_to_sleep;
>> + /**< Number of queues ready to enter power optimized state */
>> } __rte_cache_aligned;
>> +static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
>>
>> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
>> +static inline bool
>> +queue_equal(const union queue *l, const union queue *r)
>> +{
>> + return l->val == r->val;
>> +}
>> +
>> +static inline void
>> +queue_copy(union queue *dst, const union queue *src)
>> +{
>> + dst->val = src->val;
>> +}
>> +
>> +static struct queue_list_entry *
>> +queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
>> +{
>> + struct queue_list_entry *cur;
>> +
>> + TAILQ_FOREACH(cur, &cfg->head, next) {
>> + if (queue_equal(&cur->queue, q))
>> + return cur;
>> + }
>> + return NULL;
>> +}
>> +
>> +static int
>> +queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
>> +{
>> + struct queue_list_entry *qle;
>> +
>> + /* is it already in the list? */
>> + if (queue_list_find(cfg, q) != NULL)
>> + return -EEXIST;
>> +
>> + qle = malloc(sizeof(*qle));
>> + if (qle == NULL)
>> + return -ENOMEM;
>> + memset(qle, 0, sizeof(*qle));
>> +
>> + queue_copy(&qle->queue, q);
>> + TAILQ_INSERT_TAIL(&cfg->head, qle, next);
>> + cfg->n_queues++;
>> + qle->n_empty_polls = 0;
>> +
>> + return 0;
>> +}
>> +
>> +static struct queue_list_entry *
>> +queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
>> +{
>> + struct queue_list_entry *found;
>> +
>> + found = queue_list_find(cfg, q);
>> + if (found == NULL)
>> + return NULL;
>> +
>> + TAILQ_REMOVE(&cfg->head, found, next);
>> + cfg->n_queues--;
>> +
>> + /* freeing is responsibility of the caller */
>> + return found;
>> +}
>>
>> static void
>> calc_tsc(void)
>> @@ -74,21 +152,56 @@ calc_tsc(void)
>> }
>> }
>>
>> +static inline void
>> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>> +{
>> + /* reset empty poll counter for this queue */
>> + qcfg->n_empty_polls = 0;
>> + /* reset the sleep counter too */
>> + cfg->n_queues_ready_to_sleep = 0;
>> +}
>> +
>> +static inline bool
>> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>> +{
>> + /* this function is called - that means we have an empty poll */
>> + qcfg->n_empty_polls++;
>> +
>> + /* if we haven't reached threshold for empty polls, we can't sleep */
>> + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
>> + return false;
>> +
>> + /* we're ready to sleep */
>> + cfg->n_queues_ready_to_sleep++;
>> +
>> + return true;
>> +}
>> +
>> +static inline bool
>> +lcore_can_sleep(struct pmd_core_cfg *cfg)
>> +{
>> + /* are all queues ready to sleep? */
>> + if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
>> + return false;
>> +
>> + /* we've reached an iteration where we can sleep, reset sleep counter */
>> + cfg->n_queues_ready_to_sleep = 0;
>> +
>> + return true;
>> +}
>
> As I can see it a slightly modified one from what was discussed.
> I understand that it seems simpler, but I think there are some problems with it:
> - each queue can be counted more than once at lcore_cfg->n_queues_ready_to_sleep
> - queues n_empty_polls are not reset after sleep().
>
The latter is intentional: we *want* to sleep constantly once we pass
the empty poll counter.
The former shouldn't be a big problem in the conventional case as i
don't think there are situations where people would poll core-pinned
queues in different orders, but you're right, this is a potential issue
and should be fixed. I'll add back the n_sleeps in the next iteration.
> To illustrate the problem, let say we have 2 queues, and at some moment we have:
> q0.n_empty_polls == EMPTYPOLL_MAX + 1
> q1.n_empty_polls == EMPTYPOLL_MAX + 1
> cfg->n_queues_ready_to_sleep == 2
>
> So lcore_can_sleep() returns 'true' and sets:
> cfg->n_queues_ready_to_sleep == 0
>
> Now, after sleep():
> q0.n_empty_polls == EMPTYPOLL_MAX + 1
> q1.n_empty_polls == EMPTYPOLL_MAX + 1
>
> So after:
> queue_can_sleep(q0);
> queue_can_sleep(q1);
>
> will have:
> cfg->n_queues_ready_to_sleep == 2
> again, and we'll go to another sleep after just one rx_burst() attempt for each queue.
>
>> +
>> static uint16_t
>> clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>> - uint16_t nb_rx, uint16_t max_pkts __rte_unused,
>> - void *addr __rte_unused)
>> + uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
>> {
>> + struct queue_list_entry *queue_conf = arg;
>>
>> - struct pmd_queue_cfg *q_conf;
>> -
>> - q_conf = &port_cfg[port_id][qidx];
>> -
>> + /* this callback can't do more than one queue, omit multiqueue logic */
>> if (unlikely(nb_rx == 0)) {
>> - q_conf->empty_poll_stats++;
>> - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> + queue_conf->n_empty_polls++;
>> + if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
>> struct rte_power_monitor_cond pmc;
>> - uint16_t ret;
>> + int ret;
>>
>> /* use monitoring condition to sleep */
>> ret = rte_eth_get_monitor_addr(port_id, qidx,
>> @@ -97,60 +210,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>> rte_power_monitor(&pmc, UINT64_MAX);
>> }
>> } else
>> - q_conf->empty_poll_stats = 0;
>> + queue_conf->n_empty_polls = 0;
>>
>> return nb_rx;
>> }
>>
>> static uint16_t
>> -clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>> - uint16_t nb_rx, uint16_t max_pkts __rte_unused,
>> - void *addr __rte_unused)
>> +clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
>> + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> + uint16_t max_pkts __rte_unused, void *arg)
>> {
>> - struct pmd_queue_cfg *q_conf;
>> + const unsigned int lcore = rte_lcore_id();
>> + struct queue_list_entry *queue_conf = arg;
>> + struct pmd_core_cfg *lcore_conf;
>> + const bool empty = nb_rx == 0;
>>
>> - q_conf = &port_cfg[port_id][qidx];
>> + lcore_conf = &lcore_cfgs[lcore];
>>
>> - if (unlikely(nb_rx == 0)) {
>> - q_conf->empty_poll_stats++;
>> - /* sleep for 1 microsecond */
>> - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> - /* use tpause if we have it */
>> - if (global_data.intrinsics_support.power_pause) {
>> - const uint64_t cur = rte_rdtsc();
>> - const uint64_t wait_tsc =
>> - cur + global_data.tsc_per_us;
>> - rte_power_pause(wait_tsc);
>> - } else {
>> - uint64_t i;
>> - for (i = 0; i < global_data.pause_per_us; i++)
>> - rte_pause();
>> - }
>> + if (likely(!empty))
>> + /* early exit */
>> + queue_reset(lcore_conf, queue_conf);
>> + else {
>> + /* can this queue sleep? */
>> + if (!queue_can_sleep(lcore_conf, queue_conf))
>> + return nb_rx;
>> +
>> + /* can this lcore sleep? */
>> + if (!lcore_can_sleep(lcore_conf))
>> + return nb_rx;
>> +
>> + /* sleep for 1 microsecond, use tpause if we have it */
>> + if (global_data.intrinsics_support.power_pause) {
>> + const uint64_t cur = rte_rdtsc();
>> + const uint64_t wait_tsc =
>> + cur + global_data.tsc_per_us;
>> + rte_power_pause(wait_tsc);
>> + } else {
>> + uint64_t i;
>> + for (i = 0; i < global_data.pause_per_us; i++)
>> + rte_pause();
>> }
>> - } else
>> - q_conf->empty_poll_stats = 0;
>> + }
>>
>> return nb_rx;
>> }
>>
>> static uint16_t
>> -clb_scale_freq(uint16_t port_id, uint16_t qidx,
>> +clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
>> struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> - uint16_t max_pkts __rte_unused, void *_ __rte_unused)
>> + uint16_t max_pkts __rte_unused, void *arg)
>> {
>> - struct pmd_queue_cfg *q_conf;
>> + const unsigned int lcore = rte_lcore_id();
>> + const bool empty = nb_rx == 0;
>> + struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
>> + struct queue_list_entry *queue_conf = arg;
>>
>> - q_conf = &port_cfg[port_id][qidx];
>> + if (likely(!empty)) {
>> + /* early exit */
>> + queue_reset(lcore_conf, queue_conf);
>>
>> - if (unlikely(nb_rx == 0)) {
>> - q_conf->empty_poll_stats++;
>> - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
>> - /* scale down freq */
>> - rte_power_freq_min(rte_lcore_id());
>> - } else {
>> - q_conf->empty_poll_stats = 0;
>> - /* scale up freq */
>> + /* scale up freq immediately */
>> rte_power_freq_max(rte_lcore_id());
>> + } else {
>> + /* can this queue sleep? */
>> + if (!queue_can_sleep(lcore_conf, queue_conf))
>> + return nb_rx;
>> +
>> + /* can this lcore sleep? */
>> + if (!lcore_can_sleep(lcore_conf))
>> + return nb_rx;
>> +
>> + rte_power_freq_min(rte_lcore_id());
>> }
>>
>> return nb_rx;
>> @@ -167,11 +297,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
>> return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
>> }
>>
>> +static int
>> +cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
>> +{
>> + const struct queue_list_entry *entry;
>> +
>> + TAILQ_FOREACH(entry, &queue_cfg->head, next) {
>> + const union queue *q = &entry->queue;
>> + int ret = queue_stopped(q->portid, q->qid);
>> + if (ret != 1)
>> + return ret;
>> + }
>> + return 1;
>> +}
>> +
>> +static int
>> +check_scale(unsigned int lcore)
>> +{
>> + enum power_management_env env;
>> +
>> + /* only PSTATE and ACPI modes are supported */
>> + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>> + !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
>> + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
>> + return -ENOTSUP;
>> + }
>> + /* ensure we could initialize the power library */
>> + if (rte_power_init(lcore))
>> + return -EINVAL;
>> +
>> + /* ensure we initialized the correct env */
>> + env = rte_power_get_env();
>> + if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
>> + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
>> + return -ENOTSUP;
>> + }
>> +
>> + /* we're done */
>> + return 0;
>> +}
>> +
>> +static int
>> +check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
>> +{
>> + struct rte_power_monitor_cond dummy;
>> +
>> + /* check if rte_power_monitor is supported */
>> + if (!global_data.intrinsics_support.power_monitor) {
>> + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>> + return -ENOTSUP;
>> + }
>> +
>> + if (cfg->n_queues > 0) {
>> + RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
>> + return -ENOTSUP;
>> + }
>> +
>> + /* check if the device supports the necessary PMD API */
>> + if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
>> + &dummy) == -ENOTSUP) {
>> + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
>> + return -ENOTSUP;
>> + }
>> +
>> + /* we're done */
>> + return 0;
>> +}
>> +
>> int
>> rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>> uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>> {
>> - struct pmd_queue_cfg *queue_cfg;
>> + const union queue qdata = {.portid = port_id, .qid = queue_id};
>> + struct pmd_core_cfg *lcore_cfg;
>> + struct queue_list_entry *queue_cfg;
>> struct rte_eth_dev_info info;
>> rte_rx_callback_fn clb;
>> int ret;
>> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>> goto end;
>> }
>>
>> - queue_cfg = &port_cfg[port_id][queue_id];
>> + lcore_cfg = &lcore_cfgs[lcore_id];
>>
>> - if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
>> + /* check if other queues are stopped as well */
>> + ret = cfg_queues_stopped(lcore_cfg);
>> + if (ret != 1) {
>> + /* error means invalid queue, 0 means queue wasn't stopped */
>> + ret = ret < 0 ? -EINVAL : -EBUSY;
>> + goto end;
>> + }
>> +
>> + /* if callback was already enabled, check current callback type */
>> + if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
>> + lcore_cfg->cb_mode != mode) {
>> ret = -EINVAL;
>> goto end;
>> }
>> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>
>> switch (mode) {
>> case RTE_POWER_MGMT_TYPE_MONITOR:
>> - {
>> - struct rte_power_monitor_cond dummy;
>> -
>> - /* check if rte_power_monitor is supported */
>> - if (!global_data.intrinsics_support.power_monitor) {
>> - RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>> - ret = -ENOTSUP;
>> + /* check if we can add a new queue */
>> + ret = check_monitor(lcore_cfg, &qdata);
>> + if (ret < 0)
>> goto end;
>> - }
>>
>> - /* check if the device supports the necessary PMD API */
>> - if (rte_eth_get_monitor_addr(port_id, queue_id,
>> - &dummy) == -ENOTSUP) {
>> - RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
>> - ret = -ENOTSUP;
>> - goto end;
>> - }
>> clb = clb_umwait;
>> break;
>> - }
>> case RTE_POWER_MGMT_TYPE_SCALE:
>> - {
>> - enum power_management_env env;
>> - /* only PSTATE and ACPI modes are supported */
>> - if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>> - !rte_power_check_env_supported(
>> - PM_ENV_PSTATE_CPUFREQ)) {
>> - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
>> - ret = -ENOTSUP;
>> + /* check if we can add a new queue */
>> + ret = check_scale(lcore_id);
>> + if (ret < 0)
>> goto end;
>> - }
>> - /* ensure we could initialize the power library */
>> - if (rte_power_init(lcore_id)) {
>> - ret = -EINVAL;
>> - goto end;
>> - }
>> - /* ensure we initialized the correct env */
>> - env = rte_power_get_env();
>> - if (env != PM_ENV_ACPI_CPUFREQ &&
>> - env != PM_ENV_PSTATE_CPUFREQ) {
>> - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
>> - ret = -ENOTSUP;
>> - goto end;
>> - }
>> clb = clb_scale_freq;
>> break;
>> - }
>> case RTE_POWER_MGMT_TYPE_PAUSE:
>> /* figure out various time-to-tsc conversions */
>> if (global_data.tsc_per_us == 0)
>> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>> ret = -EINVAL;
>> goto end;
>> }
>> + /* add this queue to the list */
>> + ret = queue_list_add(lcore_cfg, &qdata);
>> + if (ret < 0) {
>> + RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
>> + strerror(-ret));
>> + goto end;
>> + }
>> + /* new queue is always added last */
>> + queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
>>
>> /* initialize data before enabling the callback */
>> - queue_cfg->empty_poll_stats = 0;
>> - queue_cfg->cb_mode = mode;
>> - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> - clb, NULL);
>> + if (lcore_cfg->n_queues == 1) {
>> + lcore_cfg->cb_mode = mode;
>> + lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> + }
>> + queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
>> + clb, queue_cfg);
>>
>> ret = 0;
>> end:
>> @@ -290,7 +476,9 @@ int
>> rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>> uint16_t port_id, uint16_t queue_id)
>> {
>> - struct pmd_queue_cfg *queue_cfg;
>> + const union queue qdata = {.portid = port_id, .qid = queue_id};
>> + struct pmd_core_cfg *lcore_cfg;
>> + struct queue_list_entry *queue_cfg;
>> int ret;
>>
>> RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
>> @@ -306,24 +494,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>> }
>>
>> /* no need to check queue id as wrong queue id would not be enabled */
>> - queue_cfg = &port_cfg[port_id][queue_id];
>> + lcore_cfg = &lcore_cfgs[lcore_id];
>>
>> - if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>> + /* check if other queues are stopped as well */
>> + ret = cfg_queues_stopped(lcore_cfg);
>> + if (ret != 1) {
>> + /* error means invalid queue, 0 means queue wasn't stopped */
>> + return ret < 0 ? -EINVAL : -EBUSY;
>> + }
>> +
>> + if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>> return -EINVAL;
>>
>> - /* stop any callbacks from progressing */
>> - queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>> + /*
>> + * There is no good/easy way to do this without race conditions, so we
>> + * are just going to throw our hands in the air and hope that the user
>> + * has read the documentation and has ensured that ports are stopped at
>> + * the time we enter the API functions.
>> + */
>> + queue_cfg = queue_list_take(lcore_cfg, &qdata);
>> + if (queue_cfg == NULL)
>> + return -ENOENT;
>>
>> - switch (queue_cfg->cb_mode) {
>> + /* if we've removed all queues from the lists, set state to disabled */
>> + if (lcore_cfg->n_queues == 0)
>> + lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>> +
>> + switch (lcore_cfg->cb_mode) {
>> case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
>> case RTE_POWER_MGMT_TYPE_PAUSE:
>> - rte_eth_remove_rx_callback(port_id, queue_id,
>> - queue_cfg->cur_cb);
>> + rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
>> break;
>> case RTE_POWER_MGMT_TYPE_SCALE:
>> rte_power_freq_max(lcore_id);
>> - rte_eth_remove_rx_callback(port_id, queue_id,
>> - queue_cfg->cur_cb);
>> + rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
>> rte_power_exit(lcore_id);
>> break;
>> }
>> @@ -332,7 +536,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>> * ports before calling any of these API's, so we can assume that the
>> * callbacks can be freed. we're intentionally casting away const-ness.
>> */
>> - rte_free((void *)queue_cfg->cur_cb);
>> + rte_free((void *)queue_cfg->cb);
>> + free(queue_cfg);
>>
>> return 0;
>> }
>> +
>> +RTE_INIT(rte_power_ethdev_pmgmt_init) {
>> + size_t i;
>> +
>> + /* initialize all tailqs */
>> + for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
>> + struct pmd_core_cfg *cfg = &lcore_cfgs[i];
>> + TAILQ_INIT(&cfg->head);
>> + }
>> +}
>> --
>> 2.25.1
>
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v5 6/7] power: support monitoring multiple Rx queues
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
` (4 preceding siblings ...)
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-29 15:48 ` Anatoly Burakov
2021-06-30 10:29 ` Ananyev, Konstantin
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v4:
- Fix possible out of bounds access
- Added missing index increment
doc/guides/prog_guide/power_man.rst | 9 ++--
lib/power/rte_power_pmd_mgmt.c | 81 ++++++++++++++++++++++++++++-
2 files changed, 85 insertions(+), 5 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index ec04a72108..94353ca012 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
The "monitor" mode is only supported in the following configurations and scenarios:
* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor_multi()`` function is supported by the platform, then
+ monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
``rte_power_monitor()`` is supported by the platform, then monitoring will be
limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
monitored from a different lcore).
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
- ``rte_power_monitor()`` function is not supported, then monitor mode will not
- be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+ two monitoring functions are supported, then monitor mode will not be supported.
* Not all Ethernet drivers support monitoring, even if the underlying
platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index fccfd236c2..2056996b9c 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -124,6 +124,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
return found;
}
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+ struct rte_power_monitor_cond *pmc, size_t len)
+{
+ const struct queue_list_entry *qle;
+ size_t i = 0;
+ int ret;
+
+ TAILQ_FOREACH(qle, &cfg->head, next) {
+ const union queue *q = &qle->queue;
+ struct rte_power_monitor_cond *cur;
+
+ /* attempted out of bounds access */
+ if (i >= len) {
+ RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
+ return -1;
+ }
+
+ cur = &pmc[i++];
+ ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
static void
calc_tsc(void)
{
@@ -190,6 +216,45 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
return true;
}
+static uint16_t
+clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
+{
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
+
+ lcore_conf = &lcore_cfgs[lcore];
+
+ /* early exit */
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
+ int ret;
+
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* gather all monitoring conditions */
+ ret = get_monitor_addresses(lcore_conf, pmc, RTE_DIM(pmc));
+ if (ret < 0)
+ return nb_rx;
+
+ rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX);
+ }
+
+ return nb_rx;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
@@ -341,14 +406,19 @@ static int
check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
{
struct rte_power_monitor_cond dummy;
+ bool multimonitor_supported;
/* check if rte_power_monitor is supported */
if (!global_data.intrinsics_support.power_monitor) {
RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
return -ENOTSUP;
}
+ /* check if multi-monitor is supported */
+ multimonitor_supported =
+ global_data.intrinsics_support.power_monitor_multi;
- if (cfg->n_queues > 0) {
+ /* if we're adding a new queue, do we support multiple queues? */
+ if (cfg->n_queues > 0 && !multimonitor_supported) {
RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
return -ENOTSUP;
}
@@ -364,6 +434,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
return 0;
}
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+ return global_data.intrinsics_support.power_monitor_multi ?
+ clb_multiwait : clb_umwait;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -428,7 +505,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (ret < 0)
goto end;
- clb = clb_umwait;
+ clb = get_monitor_callback();
break;
case RTE_POWER_MGMT_TYPE_SCALE:
/* check if we can add a new queue */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v5 6/7] power: support monitoring multiple Rx queues
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 6/7] power: support monitoring " Anatoly Burakov
@ 2021-06-30 10:29 ` Ananyev, Konstantin
2021-07-05 10:08 ` Burakov, Anatoly
0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-30 10:29 UTC (permalink / raw)
To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara
> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
> Rx queues while entering the energy efficient power state. The multi
> version will be used unconditionally if supported, and the UMWAIT one
> will only be used when multi-monitor is not supported by the hardware.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
> v4:
> - Fix possible out of bounds access
> - Added missing index increment
>
> doc/guides/prog_guide/power_man.rst | 9 ++--
> lib/power/rte_power_pmd_mgmt.c | 81 ++++++++++++++++++++++++++++-
> 2 files changed, 85 insertions(+), 5 deletions(-)
>
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index ec04a72108..94353ca012 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
> The "monitor" mode is only supported in the following configurations and scenarios:
>
> * If ``rte_cpu_get_intrinsics_support()`` function indicates that
> + ``rte_power_monitor_multi()`` function is supported by the platform, then
> + monitoring multiple Ethernet Rx queues for traffic will be supported.
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
> ``rte_power_monitor()`` is supported by the platform, then monitoring will be
> limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
> monitored from a different lcore).
>
> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
> - ``rte_power_monitor()`` function is not supported, then monitor mode will not
> - be supported.
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
> + two monitoring functions are supported, then monitor mode will not be supported.
>
> * Not all Ethernet drivers support monitoring, even if the underlying
> platform may support the necessary CPU instructions. Please refer to
> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> index fccfd236c2..2056996b9c 100644
> --- a/lib/power/rte_power_pmd_mgmt.c
> +++ b/lib/power/rte_power_pmd_mgmt.c
> @@ -124,6 +124,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
> return found;
> }
>
> +static inline int
> +get_monitor_addresses(struct pmd_core_cfg *cfg,
> + struct rte_power_monitor_cond *pmc, size_t len)
> +{
> + const struct queue_list_entry *qle;
> + size_t i = 0;
> + int ret;
> +
> + TAILQ_FOREACH(qle, &cfg->head, next) {
> + const union queue *q = &qle->queue;
> + struct rte_power_monitor_cond *cur;
> +
> + /* attempted out of bounds access */
> + if (i >= len) {
> + RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
> + return -1;
> + }
> +
> + cur = &pmc[i++];
> + ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
> + if (ret < 0)
> + return ret;
> + }
> + return 0;
> +}
> +
> static void
> calc_tsc(void)
> {
> @@ -190,6 +216,45 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
> return true;
> }
>
> +static uint16_t
> +clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
> + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> + uint16_t max_pkts __rte_unused, void *arg)
> +{
> + const unsigned int lcore = rte_lcore_id();
> + struct queue_list_entry *queue_conf = arg;
> + struct pmd_core_cfg *lcore_conf;
> + const bool empty = nb_rx == 0;
> +
> + lcore_conf = &lcore_cfgs[lcore];
> +
> + /* early exit */
> + if (likely(!empty))
> + /* early exit */
> + queue_reset(lcore_conf, queue_conf);
> + else {
> + struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
As discussed, I still think it needs to be pmc[lcore_conf->n_queues];
Or if VLA is not an option - alloca(), or dynamic lcore_conf->pmc[], or...
> + int ret;
> +
> + /* can this queue sleep? */
> + if (!queue_can_sleep(lcore_conf, queue_conf))
> + return nb_rx;
> +
> + /* can this lcore sleep? */
> + if (!lcore_can_sleep(lcore_conf))
> + return nb_rx;
> +
> + /* gather all monitoring conditions */
> + ret = get_monitor_addresses(lcore_conf, pmc, RTE_DIM(pmc));
> + if (ret < 0)
> + return nb_rx;
> +
> + rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX);
> + }
> +
> + return nb_rx;
> +}
> +
> static uint16_t
> clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
> @@ -341,14 +406,19 @@ static int
> check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
> {
> struct rte_power_monitor_cond dummy;
> + bool multimonitor_supported;
>
> /* check if rte_power_monitor is supported */
> if (!global_data.intrinsics_support.power_monitor) {
> RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> return -ENOTSUP;
> }
> + /* check if multi-monitor is supported */
> + multimonitor_supported =
> + global_data.intrinsics_support.power_monitor_multi;
>
> - if (cfg->n_queues > 0) {
> + /* if we're adding a new queue, do we support multiple queues? */
> + if (cfg->n_queues > 0 && !multimonitor_supported) {
> RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
> return -ENOTSUP;
> }
> @@ -364,6 +434,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
> return 0;
> }
>
> +static inline rte_rx_callback_fn
> +get_monitor_callback(void)
> +{
> + return global_data.intrinsics_support.power_monitor_multi ?
> + clb_multiwait : clb_umwait;
> +}
> +
> int
> rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
> @@ -428,7 +505,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> if (ret < 0)
> goto end;
>
> - clb = clb_umwait;
> + clb = get_monitor_callback();
> break;
> case RTE_POWER_MGMT_TYPE_SCALE:
> /* check if we can add a new queue */
> --
> 2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v5 6/7] power: support monitoring multiple Rx queues
2021-06-30 10:29 ` Ananyev, Konstantin
@ 2021-07-05 10:08 ` Burakov, Anatoly
0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-05 10:08 UTC (permalink / raw)
To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara
On 30-Jun-21 11:29 AM, Ananyev, Konstantin wrote:
>
>
>> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
>> Rx queues while entering the energy efficient power state. The multi
>> version will be used unconditionally if supported, and the UMWAIT one
>> will only be used when multi-monitor is not supported by the hardware.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>> v4:
>> - Fix possible out of bounds access
>> - Added missing index increment
>>
>> doc/guides/prog_guide/power_man.rst | 9 ++--
>> lib/power/rte_power_pmd_mgmt.c | 81 ++++++++++++++++++++++++++++-
>> 2 files changed, 85 insertions(+), 5 deletions(-)
>>
>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>> index ec04a72108..94353ca012 100644
>> --- a/doc/guides/prog_guide/power_man.rst
>> +++ b/doc/guides/prog_guide/power_man.rst
>> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
>> The "monitor" mode is only supported in the following configurations and scenarios:
>>
>> * If ``rte_cpu_get_intrinsics_support()`` function indicates that
>> + ``rte_power_monitor_multi()`` function is supported by the platform, then
>> + monitoring multiple Ethernet Rx queues for traffic will be supported.
>> +
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
>> ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>> limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>> monitored from a different lcore).
>>
>> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
>> - ``rte_power_monitor()`` function is not supported, then monitor mode will not
>> - be supported.
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
>> + two monitoring functions are supported, then monitor mode will not be supported.
>>
>> * Not all Ethernet drivers support monitoring, even if the underlying
>> platform may support the necessary CPU instructions. Please refer to
>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>> index fccfd236c2..2056996b9c 100644
>> --- a/lib/power/rte_power_pmd_mgmt.c
>> +++ b/lib/power/rte_power_pmd_mgmt.c
>> @@ -124,6 +124,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
>> return found;
>> }
>>
>> +static inline int
>> +get_monitor_addresses(struct pmd_core_cfg *cfg,
>> + struct rte_power_monitor_cond *pmc, size_t len)
>> +{
>> + const struct queue_list_entry *qle;
>> + size_t i = 0;
>> + int ret;
>> +
>> + TAILQ_FOREACH(qle, &cfg->head, next) {
>> + const union queue *q = &qle->queue;
>> + struct rte_power_monitor_cond *cur;
>> +
>> + /* attempted out of bounds access */
>> + if (i >= len) {
>> + RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
>> + return -1;
>> + }
>> +
>> + cur = &pmc[i++];
>> + ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
>> + if (ret < 0)
>> + return ret;
>> + }
>> + return 0;
>> +}
>> +
>> static void
>> calc_tsc(void)
>> {
>> @@ -190,6 +216,45 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
>> return true;
>> }
>>
>> +static uint16_t
>> +clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
>> + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> + uint16_t max_pkts __rte_unused, void *arg)
>> +{
>> + const unsigned int lcore = rte_lcore_id();
>> + struct queue_list_entry *queue_conf = arg;
>> + struct pmd_core_cfg *lcore_conf;
>> + const bool empty = nb_rx == 0;
>> +
>> + lcore_conf = &lcore_cfgs[lcore];
>> +
>> + /* early exit */
>> + if (likely(!empty))
>> + /* early exit */
>> + queue_reset(lcore_conf, queue_conf);
>> + else {
>> + struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
>
> As discussed, I still think it needs to be pmc[lcore_conf->n_queues];
> Or if VLA is not an option - alloca(), or dynamic lcore_conf->pmc[], or...
>
Apologies, this was a rebase mistake. Thanks for catching it! Will fix
in v6.
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v5 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
` (5 preceding siblings ...)
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 6/7] power: support monitoring " Anatoly Burakov
@ 2021-06-29 15:48 ` Anatoly Burakov
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
examples/l3fwd-power/main.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..52f56dc405 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2723,12 +2723,6 @@ main(int argc, char **argv)
printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
fflush(stdout);
- /* PMD power management mode can only do 1 queue per core */
- if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
- rte_exit(EXIT_FAILURE,
- "In PMD power management mode, only one queue per lcore is allowed\n");
- }
-
/* init RX queues */
for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
struct rte_eth_rxconf rxq_conf;
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
` (6 preceding siblings ...)
2021-06-29 15:48 ` [dpdk-dev] [PATCH v5 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-07-05 15:21 ` Anatoly Burakov
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
` (7 more replies)
7 siblings, 8 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:21 UTC (permalink / raw)
To: dev; +Cc: david.hunt, ciara.loftus, konstantin.ananyev
This patchset introduces several changes related to PMD power management:
- Changed monitoring intrinsics to use callbacks as a comparison function, based
on previous patchset [1] but incorporating feedback [2] - this hopefully will
make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
accompanying infrastructure and example apps changes
v6:
- Improved the algorithm for multi-queue sleep
- Fixed segfault and addressed other feedback
v5:
- Removed "power save queue" API and replaced with mechanism suggested by
Konstantin
- Addressed other feedback
v4:
- Replaced raw number with a macro
- Fixed all the bugs found by Konstantin
- Some other minor corrections
v3:
- Moved some doc updates to NIC features list
v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary
[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274
Anatoly Burakov (7):
power_intrinsics: use callbacks for comparison
net/af_xdp: add power monitor support
eal: add power monitor for multiple events
power: remove thread safety from PMD power API's
power: support callbacks for multiple Rx queues
power: support monitoring multiple Rx queues
l3fwd-power: support multiqueue in PMD pmgmt modes
doc/guides/nics/features.rst | 10 +
doc/guides/prog_guide/power_man.rst | 68 +-
doc/guides/rel_notes/release_21_08.rst | 11 +
drivers/event/dlb2/dlb2.c | 17 +-
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +
drivers/net/i40e/i40e_rxtx.c | 20 +-
drivers/net/iavf/iavf_rxtx.c | 20 +-
drivers/net/ice/ice_rxtx.c | 20 +-
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +-
drivers/net/mlx5/mlx5_rx.c | 17 +-
examples/l3fwd-power/main.c | 6 -
lib/eal/arm/rte_power_intrinsics.c | 11 +
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 68 +-
lib/eal/ppc/rte_power_intrinsics.c | 11 +
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 90 ++-
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 655 +++++++++++++-----
lib/power/rte_power_pmd_mgmt.h | 6 +
21 files changed, 832 insertions(+), 262 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v6 1/7] power_intrinsics: use callbacks for comparison
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-07-05 15:21 ` Anatoly Burakov
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 2/7] net/af_xdp: add power monitor support Anatoly Burakov
` (6 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:21 UTC (permalink / raw)
To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.
This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.
Existing implementations are adjusted to follow the new semantics.
Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
Notes:
v4:
- Return error if callback is set to NULL
- Replace raw number with a macro in monitor condition opaque data
v2:
- Use callback mechanism for more flexibility
- Address feedback from Konstantin
doc/guides/rel_notes/release_21_08.rst | 1 +
drivers/event/dlb2/dlb2.c | 17 ++++++++--
drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
drivers/net/ice/ice_rxtx.c | 20 +++++++----
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
.../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
9 files changed, 121 insertions(+), 44 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index a6ecfdf3ce..c84ac280f5 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -84,6 +84,7 @@ API Changes
Also, make sure to start the actual text at the margin.
=======================================================
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
ABI Changes
-----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
}
}
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ /* abort if the value matches */
+ return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
static inline int
dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
expected_value = 0;
pmc.addr = monitor_addr;
- pmc.val = expected_value;
- pmc.mask = qe_mask.raw_qe[1];
+ /* store expected value and comparison mask in opaque data */
+ pmc.opaque[CLB_VAL_IDX] = expected_value;
+ pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+ /* set up callback */
+ pmc.fn = dlb2_monitor_callback;
pmc.size = sizeof(uint64_t);
rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decece..081682f88b 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
#define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
+static int
+i40e_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = i40e_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index 0361af0d85..7ed196ec22 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
}
+static int
+iavf_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = iavf_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index fc9bb5a3e7..d12437d19d 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
+static int
+ice_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.status_error0;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
- pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /* comparison callback */
+ pmc->fn = ice_monitor_callback;
/* register is 16-bit */
pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
};
+static int
+ixgbe_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.upper.status_error;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
- pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /* comparison callback */
+ pmc->fn = ixgbe_monitor_callback;
/* the registers are 32-bit */
pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..17370b77dc 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
return rx_queue_count(rxq);
}
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t m = opaque[CLB_MSK_IDX];
+ const uint64_t v = opaque[CLB_VAL_IDX];
+
+ return (value & m) == v ? -1 : 0;
+}
+
int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
return -rte_errno;
}
pmc->addr = &cqe->op_own;
- pmc->val = !!idx;
- pmc->mask = MLX5_CQE_OWNER_MASK;
+ pmc->opaque[CLB_VAL_IDX] = !!idx;
+ pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+ pmc->fn = mlx_monitor_callback;
pmc->size = sizeof(uint8_t);
return 0;
}
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
* which are architecture-dependent.
*/
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ * The value read from memory.
+ * @param opaque
+ * Callback-specific data.
+ *
+ * @return
+ * 0 if entering of power optimized state should proceed
+ * -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
struct rte_power_monitor_cond {
volatile void *addr; /**< Address to monitor for changes */
- uint64_t val; /**< If the `mask` is non-zero, location pointed
- * to by `addr` will be read and compared
- * against this value.
- */
- uint64_t mask; /**< 64-bit mask to extract value read from `addr` */
- uint8_t size; /**< Data size (in bytes) that will be used to compare
- * expected value (`val`) with data read from the
+ uint8_t size; /**< Data size (in bytes) that will be read from the
* monitored memory location (`addr`). Can be 1, 2,
* 4, or 8. Supplying any other value will result in
* an error.
*/
+ rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+ * entering power optimized state should
+ * be aborted.
+ */
+ uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+ /**< Callback-specific data */
};
/**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
const unsigned int lcore_id = rte_lcore_id();
struct power_wait_status *s;
+ uint64_t cur_value;
/* prevent user from running this instruction if it's not supported */
if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
if (__check_val_size(pmc->size) < 0)
return -EINVAL;
+ if (pmc->fn == NULL)
+ return -EINVAL;
+
s = &wait_status[lcore_id];
/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
/* now that we've put this address into monitor, we can unlock */
rte_spinlock_unlock(&s->lock);
- /* if we have a comparison mask, we might not need to sleep at all */
- if (pmc->mask) {
- const uint64_t cur_value = __get_umwait_val(
- pmc->addr, pmc->size);
- const uint64_t masked = cur_value & pmc->mask;
+ cur_value = __get_umwait_val(pmc->addr, pmc->size);
- /* if the masked value is already matching, abort */
- if (masked == pmc->val)
- goto end;
- }
+ /* check if callback indicates we should abort */
+ if (pmc->fn(cur_value, pmc->opaque) != 0)
+ goto end;
/* execute UMWAIT */
asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v6 2/7] net/af_xdp: add power monitor support
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-07-05 15:21 ` Anatoly Burakov
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 3/7] eal: add power monitor for multiple events Anatoly Burakov
` (5 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:21 UTC (permalink / raw)
To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt, konstantin.ananyev
Implement support for .get_monitor_addr in AF_XDP driver.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v2:
- Rewrite using the callback mechanism
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..7830d0c23a 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
#include <rte_malloc.h>
#include <rte_ring.h>
#include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
#include "compat.h"
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
return 0;
}
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t v = opaque[CLB_VAL_IDX];
+ const uint64_t m = (uint32_t)~0;
+
+ /* if the value has changed, abort entering power optimized state */
+ return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+ struct pkt_rx_queue *rxq = rx_queue;
+ unsigned int *prod = rxq->rx.producer;
+ const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+ /* watch for changes in producer ring */
+ pmc->addr = (void*)prod;
+
+ /* store current value */
+ pmc->opaque[CLB_VAL_IDX] = cur_val;
+ pmc->fn = eth_monitor_callback;
+
+ /* AF_XDP producer ring index is 32-bit */
+ pmc->size = sizeof(uint32_t);
+
+ return 0;
+}
+
static int
eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
{
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
.link_update = eth_link_update,
.stats_get = eth_stats_get,
.stats_reset = eth_stats_reset,
+ .get_monitor_addr = eth_get_monitor_addr
};
/** parse busy_budget argument */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v6 3/7] eal: add power monitor for multiple events
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-07-05 15:21 ` Anatoly Burakov
2021-08-04 9:52 ` Kinsella, Ray
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
` (4 subsequent siblings)
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:21 UTC (permalink / raw)
To: dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v4:
- Fixed bugs in accessing the monitor condition
- Abort on any monitor condition not having a defined callback
v2:
- Adapt to callback mechanism
doc/guides/rel_notes/release_21_08.rst | 2 +
lib/eal/arm/rte_power_intrinsics.c | 11 +++
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 35 +++++++++
lib/eal/ppc/rte_power_intrinsics.c | 11 +++
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++
8 files changed, 139 insertions(+)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index c84ac280f5..9d1cfac395 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -55,6 +55,8 @@ New Features
Also, make sure to start the actual text at the margin.
=======================================================
+* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+
Removed Items
-------------
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
/**< indicates support for rte_power_monitor function */
uint32_t power_pause : 1;
/**< indicates support for rte_power_pause function */
+ uint32_t power_monitor_multi : 1;
+ /**< indicates support for rte_power_monitor_multi function */
};
/**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
__rte_experimental
int rte_power_pause(const uint64_t tsc_timestamp);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ * supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ * Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ * An array of monitoring condition structures.
+ * @param num
+ * Length of the `pmc` array.
+ * @param tsc_timestamp
+ * Maximum TSC timestamp to wait for. Note that the wait behavior is
+ * architecture-dependent.
+ *
+ * @return
+ * 0 on success
+ * -EINVAL on invalid parameters
+ * -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp);
+
#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..4ccd5475d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
rte_version_release; # WINDOWS_NO_EXPORT
rte_version_suffix; # WINDOWS_NO_EXPORT
rte_version_year; # WINDOWS_NO_EXPORT
+
+ # added in 21.08
+ rte_power_monitor_multi; # WINDOWS_NO_EXPORT
};
INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
intrinsics->power_monitor = 1;
intrinsics->power_pause = 1;
+ if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+ intrinsics->power_monitor_multi = 1;
}
}
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
#include <rte_common.h>
#include <rte_lcore.h>
+#include <rte_rtm.h>
#include <rte_spinlock.h>
#include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
}
static bool wait_supported;
+static bool wait_multi_supported;
static inline uint64_t
__get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
if (i.power_monitor && i.power_pause)
wait_supported = 1;
+ if (i.power_monitor_multi)
+ wait_multi_supported = 1;
}
int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
* In this case, since we've already woken up, the "wakeup" was
* unneeded, and since T1 is still waiting on T2 releasing the lock, the
* wakeup address is still valid so it's perfectly safe to write it.
+ *
+ * For multi-monitor case, the act of locking will in itself trigger the
+ * wakeup, so no additional writes necessary.
*/
rte_spinlock_lock(&s->lock);
if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return 0;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ const unsigned int lcore_id = rte_lcore_id();
+ struct power_wait_status *s = &wait_status[lcore_id];
+ uint32_t i, rc;
+
+ /* check if supported */
+ if (!wait_multi_supported)
+ return -ENOTSUP;
+
+ if (pmc == NULL || num == 0)
+ return -EINVAL;
+
+ /* we are already inside transaction region, return */
+ if (rte_xtest() != 0)
+ return 0;
+
+ /* start new transaction region */
+ rc = rte_xbegin();
+
+ /* transaction abort, possible write to one of wait addresses */
+ if (rc != RTE_XBEGIN_STARTED)
+ return 0;
+
+ /*
+ * the mere act of reading the lock status here adds the lock to
+ * the read set. This means that when we trigger a wakeup from another
+ * thread, even if we don't have a defined wakeup address and thus don't
+ * actually cause any writes, the act of locking our lock will itself
+ * trigger the wakeup and abort the transaction.
+ */
+ rte_spinlock_is_locked(&s->lock);
+
+ /*
+ * add all addresses to wait on into transaction read-set and check if
+ * any of wakeup conditions are already met.
+ */
+ rc = 0;
+ for (i = 0; i < num; i++) {
+ const struct rte_power_monitor_cond *c = &pmc[i];
+
+ /* cannot be NULL */
+ if (c->fn == NULL) {
+ rc = -EINVAL;
+ break;
+ }
+
+ const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+ /* abort if callback indicates that we need to stop */
+ if (c->fn(val, c->opaque) != 0)
+ break;
+ }
+
+ /* none of the conditions were met, sleep until timeout */
+ if (i == num)
+ rte_power_pause(tsc_timestamp);
+
+ /* end transaction region */
+ rte_xend();
+
+ return rc;
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 3/7] eal: add power monitor for multiple events
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-08-04 9:52 ` Kinsella, Ray
0 siblings, 0 replies; 165+ messages in thread
From: Kinsella, Ray @ 2021-08-04 9:52 UTC (permalink / raw)
To: Anatoly Burakov, dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
David Christensen, Neil Horman, Bruce Richardson,
Konstantin Ananyev
Cc: david.hunt, ciara.loftus
On 05/07/2021 16:21, Anatoly Burakov wrote:
> Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
> what UMWAIT does, but without the limitation of having to listen for
> just one event. This works because the optimized power state used by the
> TPAUSE instruction will cause a wake up on RTM transaction abort, so if
> we add the addresses we're interested in to the read-set, any write to
> those addresses will wake us up.
>
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
> v4:
> - Fixed bugs in accessing the monitor condition
> - Abort on any monitor condition not having a defined callback
>
> v2:
> - Adapt to callback mechanism
>
> doc/guides/rel_notes/release_21_08.rst | 2 +
> lib/eal/arm/rte_power_intrinsics.c | 11 +++
> lib/eal/include/generic/rte_cpuflags.h | 2 +
> .../include/generic/rte_power_intrinsics.h | 35 +++++++++
> lib/eal/ppc/rte_power_intrinsics.c | 11 +++
> lib/eal/version.map | 3 +
> lib/eal/x86/rte_cpuflags.c | 2 +
> lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++
> 8 files changed, 139 insertions(+)
>
Acked-by: Ray Kinsella <mdr@ashroe.eu>
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v6 4/7] power: remove thread safety from PMD power API's
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
` (2 preceding siblings ...)
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-07-05 15:21 ` Anatoly Burakov
2021-07-07 10:14 ` Ananyev, Konstantin
2021-07-05 15:22 ` [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
` (3 subsequent siblings)
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:21 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.
We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v2:
- Add check for stopped queue
- Clarified doc message
- Added release notes
doc/guides/rel_notes/release_21_08.rst | 5 +
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 133 ++++++++++---------------
lib/power/rte_power_pmd_mgmt.h | 6 ++
4 files changed, 67 insertions(+), 80 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 9d1cfac395..f015c509fc 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -88,6 +88,11 @@ API Changes
* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+* rte_power: The experimental PMD power management API is no longer considered
+ to be thread safe; all Rx queues affected by the API will now need to be
+ stopped before making any changes to the power management scheme.
+
+
ABI Changes
-----------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
'rte_power_pmd_mgmt.h',
'rte_power_guest_channel.h',
)
+if cc.has_argument('-Wno-cast-qual')
+ cflags += '-Wno-cast-qual'
+endif
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
/**< Callback mode for this queue */
const struct rte_eth_rxtx_callback *cur_cb;
/**< Callback instance */
- volatile bool umwait_in_progress;
- /**< are we currently sleeping? */
uint64_t empty_poll_stats;
/**< Number of empty polls */
} __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
struct rte_power_monitor_cond pmc;
uint16_t ret;
- /*
- * we might get a cancellation request while being
- * inside the callback, in which case the wakeup
- * wouldn't work because it would've arrived too early.
- *
- * to get around this, we notify the other thread that
- * we're sleeping, so that it can spin until we're done.
- * unsolicited wakeups are perfectly safe.
- */
- q_conf->umwait_in_progress = true;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- /* check if we need to cancel sleep */
- if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
- /* use monitoring condition to sleep */
- ret = rte_eth_get_monitor_addr(port_id, qidx,
- &pmc);
- if (ret == 0)
- rte_power_monitor(&pmc, UINT64_MAX);
- }
- q_conf->umwait_in_progress = false;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+ /* use monitoring condition to sleep */
+ ret = rte_eth_get_monitor_addr(port_id, qidx,
+ &pmc);
+ if (ret == 0)
+ rte_power_monitor(&pmc, UINT64_MAX);
}
} else
q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
return nb_rx;
}
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+ struct rte_eth_rxq_info qinfo;
+
+ if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+ return -1;
+
+ return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
struct pmd_queue_cfg *queue_cfg;
struct rte_eth_dev_info info;
+ rte_rx_callback_fn clb;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
queue_cfg = &port_cfg[port_id][queue_id];
if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->umwait_in_progress = false;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* ensure we update our state before callback starts */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_umwait, NULL);
+ clb = clb_umwait;
break;
}
case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
- queue_id, clb_scale_freq, NULL);
+ clb = clb_scale_freq;
break;
}
case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (global_data.tsc_per_us == 0)
calc_tsc();
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_pause, NULL);
+ clb = clb_pause;
break;
+ default:
+ RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+ ret = -EINVAL;
+ goto end;
}
+
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, NULL);
+
ret = 0;
end:
return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
struct pmd_queue_cfg *queue_cfg;
+ int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
return -EINVAL;
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
/* no need to check queue id as wrong queue id would not be enabled */
queue_cfg = &port_cfg[port_id][queue_id];
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
/* stop any callbacks from progressing */
queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
- /* ensure we update our state before continuing */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
switch (queue_cfg->cb_mode) {
- case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- bool exit = false;
- do {
- /*
- * we may request cancellation while the other thread
- * has just entered the callback but hasn't started
- * sleeping yet, so keep waking it up until we know it's
- * done sleeping.
- */
- if (queue_cfg->umwait_in_progress)
- rte_power_monitor_wakeup(lcore_id);
- else
- exit = true;
- } while (!exit);
- }
- /* fall-through */
+ case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
rte_eth_remove_rx_callback(port_id, queue_id,
queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
break;
}
/*
- * we don't free the RX callback here because it is unsafe to do so
- * unless we know for a fact that all data plane threads have stopped.
+ * the API doc mandates that the user stops all processing on affected
+ * ports before calling any of these API's, so we can assume that the
+ * callbacks can be freed. we're intentionally casting away const-ness.
*/
- queue_cfg->cur_cb = NULL;
+ rte_free((void *)queue_cfg->cur_cb);
return 0;
}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue will be polled from.
* @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue is polled from.
* @param port_id
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 4/7] power: remove thread safety from PMD power API's
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-07-07 10:14 ` Ananyev, Konstantin
0 siblings, 0 replies; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-07-07 10:14 UTC (permalink / raw)
To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara
> Currently, we expect that only one callback can be active at any given
> moment, for a particular queue configuration, which is relatively easy
> to implement in a thread-safe way. However, we're about to add support
> for multiple queues per lcore, which will greatly increase the
> possibility of various race conditions.
>
> We could have used something like an RCU for this use case, but absent
> of a pressing need for thread safety we'll go the easy way and just
> mandate that the API's are to be called when all affected ports are
> stopped, and document this limitation. This greatly simplifies the
> `rte_power_monitor`-related code.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> 2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
` (3 preceding siblings ...)
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-07-05 15:22 ` Anatoly Burakov
2021-07-06 18:50 ` Ananyev, Konstantin
2021-07-07 10:04 ` David Hunt
2021-07-05 15:22 ` [dpdk-dev] [PATCH v6 6/7] power: support monitoring " Anatoly Burakov
` (2 subsequent siblings)
7 siblings, 2 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:22 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.
The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:
- Replace per-queue structures with per-lcore ones, so that any device
polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
added to the list of queues to poll, so that the callback is aware of
other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
shared between all queues polled on a particular lcore, and is only
activated when all queues in the list were polled and were determined
to have no traffic.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
is incapable of monitoring more than one address.
Also, while we're at it, update and improve the docs.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v6:
- Track each individual queue sleep status (Konstantin)
- Fix segfault (Dave)
v5:
- Remove the "power save queue" API and replace it with mechanism suggested by
Konstantin
v3:
- Move the list of supported NICs to NIC feature table
v2:
- Use a TAILQ for queues instead of a static array
- Address feedback from Konstantin
- Add additional checks for stopped queues
doc/guides/nics/features.rst | 10 +
doc/guides/prog_guide/power_man.rst | 65 ++--
doc/guides/rel_notes/release_21_08.rst | 3 +
lib/power/rte_power_pmd_mgmt.c | 452 +++++++++++++++++++------
4 files changed, 394 insertions(+), 136 deletions(-)
diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
* **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
* **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
.. _nic_features_other:
Other dev ops not represented by a Feature
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..ec04a72108 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,41 @@ Ethernet PMD Power Management API
Abstract
~~~~~~~~
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
- This power saving scheme will put the CPU into optimized power state
- and use the ``rte_power_monitor()`` function
- to monitor the Ethernet PMD RX descriptor address,
- and wake the CPU up whenever there's new traffic.
-
-Pause
- This power saving scheme will avoid busy polling
- by either entering power-optimized sleep state
- with ``rte_power_pause()`` function,
- or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
- This power saving scheme will use ``librte_power`` library
- functionality to scale the core frequency up/down
- depending on traffic volume.
-
-.. note::
-
- Currently, this power management API is limited to mandatory mapping
- of 1 queue to 1 core (multiple queues are supported,
- but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+ This power saving scheme will put the CPU into optimized power state and
+ monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+ there's new traffic. Support for this scheme may not be available on all
+ platforms, and further limitations may apply (see below).
+
+* Pause
+ This power saving scheme will avoid busy polling by either entering
+ power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+ not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+ This power saving scheme will use ``librte_power`` library functionality to
+ scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+ limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+ monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+ ``rte_power_monitor()`` function is not supported, then monitor mode will not
+ be supported.
+
+* Not all Ethernet drivers support monitoring, even if the underlying
+ platform may support the necessary CPU instructions. Please refer to
+ :doc:`../nics/overview` for more information.
+
API Overview for Ethernet PMD Power Management
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -242,3 +249,5 @@ References
* The :doc:`../sample_app_ug/vm_power_management`
chapter in the :doc:`../sample_app_ug/index` section.
+
+* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index f015c509fc..3926d45ef8 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -57,6 +57,9 @@ New Features
* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+* rte_power: The experimental PMD power management API now supports managing
+ multiple Ethernet Rx queues per lcore.
+
Removed Items
-------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..9ffeda05ed 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,18 +33,98 @@ enum pmd_mgmt_state {
PMD_MGMT_ENABLED
};
-struct pmd_queue_cfg {
+union queue {
+ uint32_t val;
+ struct {
+ uint16_t portid;
+ uint16_t qid;
+ };
+};
+
+struct queue_list_entry {
+ TAILQ_ENTRY(queue_list_entry) next;
+ union queue queue;
+ uint64_t n_empty_polls;
+ uint64_t n_sleeps;
+ const struct rte_eth_rxtx_callback *cb;
+};
+
+struct pmd_core_cfg {
+ TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+ /**< List of queues associated with this lcore */
+ size_t n_queues;
+ /**< How many queues are in the list? */
volatile enum pmd_mgmt_state pwr_mgmt_state;
/**< State of power management for this queue */
enum rte_power_pmd_mgmt_type cb_mode;
/**< Callback mode for this queue */
- const struct rte_eth_rxtx_callback *cur_cb;
- /**< Callback instance */
- uint64_t empty_poll_stats;
- /**< Number of empty polls */
+ uint64_t n_queues_ready_to_sleep;
+ /**< Number of queues ready to enter power optimized state */
+ uint64_t sleep_target;
+ /**< Prevent a queue from triggering sleep multiple times */
} __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+ return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+ dst->val = src->val;
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *cur;
+
+ TAILQ_FOREACH(cur, &cfg->head, next) {
+ if (queue_equal(&cur->queue, q))
+ return cur;
+ }
+ return NULL;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *qle;
+
+ /* is it already in the list? */
+ if (queue_list_find(cfg, q) != NULL)
+ return -EEXIST;
+
+ qle = malloc(sizeof(*qle));
+ if (qle == NULL)
+ return -ENOMEM;
+ memset(qle, 0, sizeof(*qle));
+
+ queue_copy(&qle->queue, q);
+ TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+ cfg->n_queues++;
+
+ return 0;
+}
+
+static struct queue_list_entry *
+queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *found;
+
+ found = queue_list_find(cfg, q);
+ if (found == NULL)
+ return NULL;
+
+ TAILQ_REMOVE(&cfg->head, found, next);
+ cfg->n_queues--;
+
+ /* freeing is responsibility of the caller */
+ return found;
+}
static void
calc_tsc(void)
@@ -74,21 +154,75 @@ calc_tsc(void)
}
}
+static inline void
+queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
+
+ /* reset empty poll counter for this queue */
+ qcfg->n_empty_polls = 0;
+ /* reset the queue sleep counter as well */
+ qcfg->n_sleeps = 0;
+ /* remove the queue from list of cores ready to sleep */
+ if (is_ready_to_sleep)
+ cfg->n_queues_ready_to_sleep--;
+ /*
+ * no need change the lcore sleep target counter because this lcore will
+ * reach the n_sleeps anyway, and the other cores are already counted so
+ * there's no need to do anything else.
+ */
+}
+
+static inline bool
+queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ /* this function is called - that means we have an empty poll */
+ qcfg->n_empty_polls++;
+
+ /* if we haven't reached threshold for empty polls, we can't sleep */
+ if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
+ return false;
+
+ /*
+ * we've reached a point where we are able to sleep, but we still need
+ * to check if this queue has already been marked for sleeping.
+ */
+ if (qcfg->n_sleeps == cfg->sleep_target)
+ return true;
+
+ /* mark this queue as ready for sleep */
+ qcfg->n_sleeps = cfg->sleep_target;
+ cfg->n_queues_ready_to_sleep++;
+
+ return true;
+}
+
+static inline bool
+lcore_can_sleep(struct pmd_core_cfg *cfg)
+{
+ /* are all queues ready to sleep? */
+ if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
+ return false;
+
+ /* we've reached an iteration where we can sleep, reset sleep counter */
+ cfg->n_queues_ready_to_sleep = 0;
+ cfg->sleep_target++;
+
+ return true;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+ uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
{
+ struct queue_list_entry *queue_conf = arg;
- struct pmd_queue_cfg *q_conf;
-
- q_conf = &port_cfg[port_id][qidx];
-
+ /* this callback can't do more than one queue, omit multiqueue logic */
if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+ queue_conf->n_empty_polls++;
+ if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
struct rte_power_monitor_cond pmc;
- uint16_t ret;
+ int ret;
/* use monitoring condition to sleep */
ret = rte_eth_get_monitor_addr(port_id, qidx,
@@ -97,60 +231,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
rte_power_monitor(&pmc, UINT64_MAX);
}
} else
- q_conf->empty_poll_stats = 0;
+ queue_conf->n_empty_polls = 0;
return nb_rx;
}
static uint16_t
-clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
- q_conf = &port_cfg[port_id][qidx];
+ lcore_conf = &lcore_cfgs[lcore];
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- /* sleep for 1 microsecond */
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
- /* use tpause if we have it */
- if (global_data.intrinsics_support.power_pause) {
- const uint64_t cur = rte_rdtsc();
- const uint64_t wait_tsc =
- cur + global_data.tsc_per_us;
- rte_power_pause(wait_tsc);
- } else {
- uint64_t i;
- for (i = 0; i < global_data.pause_per_us; i++)
- rte_pause();
- }
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* sleep for 1 microsecond, use tpause if we have it */
+ if (global_data.intrinsics_support.power_pause) {
+ const uint64_t cur = rte_rdtsc();
+ const uint64_t wait_tsc =
+ cur + global_data.tsc_per_us;
+ rte_power_pause(wait_tsc);
+ } else {
+ uint64_t i;
+ for (i = 0; i < global_data.pause_per_us; i++)
+ rte_pause();
}
- } else
- q_conf->empty_poll_stats = 0;
+ }
return nb_rx;
}
static uint16_t
-clb_scale_freq(uint16_t port_id, uint16_t qidx,
+clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
- uint16_t max_pkts __rte_unused, void *_ __rte_unused)
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ const bool empty = nb_rx == 0;
+ struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+ struct queue_list_entry *queue_conf = arg;
- q_conf = &port_cfg[port_id][qidx];
+ if (likely(!empty)) {
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
- /* scale down freq */
- rte_power_freq_min(rte_lcore_id());
- } else {
- q_conf->empty_poll_stats = 0;
- /* scale up freq */
+ /* scale up freq immediately */
rte_power_freq_max(rte_lcore_id());
+ } else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ rte_power_freq_min(rte_lcore_id());
}
return nb_rx;
@@ -167,11 +318,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
}
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+ const struct queue_list_entry *entry;
+
+ TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+ const union queue *q = &entry->queue;
+ int ret = queue_stopped(q->portid, q->qid);
+ if (ret != 1)
+ return ret;
+ }
+ return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+ enum power_management_env env;
+
+ /* only PSTATE and ACPI modes are supported */
+ if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+ !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+ return -ENOTSUP;
+ }
+ /* ensure we could initialize the power library */
+ if (rte_power_init(lcore))
+ return -EINVAL;
+
+ /* ensure we initialized the correct env */
+ env = rte_power_get_env();
+ if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+ struct rte_power_monitor_cond dummy;
+
+ /* check if rte_power_monitor is supported */
+ if (!global_data.intrinsics_support.power_monitor) {
+ RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+ return -ENOTSUP;
+ }
+
+ if (cfg->n_queues > 0) {
+ RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+ return -ENOTSUP;
+ }
+
+ /* check if the device supports the necessary PMD API */
+ if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+ &dummy) == -ENOTSUP) {
+ RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
struct rte_eth_dev_info info;
rte_rx_callback_fn clb;
int ret;
@@ -202,9 +422,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
+ /* if callback was already enabled, check current callback type */
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+ lcore_cfg->cb_mode != mode) {
ret = -EINVAL;
goto end;
}
@@ -214,53 +444,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
switch (mode) {
case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- struct rte_power_monitor_cond dummy;
-
- /* check if rte_power_monitor is supported */
- if (!global_data.intrinsics_support.power_monitor) {
- RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_monitor(lcore_cfg, &qdata);
+ if (ret < 0)
goto end;
- }
- /* check if the device supports the necessary PMD API */
- if (rte_eth_get_monitor_addr(port_id, queue_id,
- &dummy) == -ENOTSUP) {
- RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_umwait;
break;
- }
case RTE_POWER_MGMT_TYPE_SCALE:
- {
- enum power_management_env env;
- /* only PSTATE and ACPI modes are supported */
- if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
- !rte_power_check_env_supported(
- PM_ENV_PSTATE_CPUFREQ)) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_scale(lcore_id);
+ if (ret < 0)
goto end;
- }
- /* ensure we could initialize the power library */
- if (rte_power_init(lcore_id)) {
- ret = -EINVAL;
- goto end;
- }
- /* ensure we initialized the correct env */
- env = rte_power_get_env();
- if (env != PM_ENV_ACPI_CPUFREQ &&
- env != PM_ENV_PSTATE_CPUFREQ) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_scale_freq;
break;
- }
case RTE_POWER_MGMT_TYPE_PAUSE:
/* figure out various time-to-tsc conversions */
if (global_data.tsc_per_us == 0)
@@ -273,13 +470,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -EINVAL;
goto end;
}
+ /* add this queue to the list */
+ ret = queue_list_add(lcore_cfg, &qdata);
+ if (ret < 0) {
+ RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+ strerror(-ret));
+ goto end;
+ }
+ /* new queue is always added last */
+ queue_cfg = TAILQ_LAST(&lcore_cfg->head, queue_list_head);
/* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb, NULL);
+ if (lcore_cfg->n_queues == 1) {
+ lcore_cfg->cb_mode = mode;
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ }
+ queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, queue_cfg);
ret = 0;
end:
@@ -290,7 +497,9 @@ int
rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,24 +515,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
}
/* no need to check queue id as wrong queue id would not be enabled */
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
return -EINVAL;
- /* stop any callbacks from progressing */
- queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+ /*
+ * There is no good/easy way to do this without race conditions, so we
+ * are just going to throw our hands in the air and hope that the user
+ * has read the documentation and has ensured that ports are stopped at
+ * the time we enter the API functions.
+ */
+ queue_cfg = queue_list_take(lcore_cfg, &qdata);
+ if (queue_cfg == NULL)
+ return -ENOENT;
- switch (queue_cfg->cb_mode) {
+ /* if we've removed all queues from the lists, set state to disabled */
+ if (lcore_cfg->n_queues == 0)
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+ switch (lcore_cfg->cb_mode) {
case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
break;
case RTE_POWER_MGMT_TYPE_SCALE:
rte_power_freq_max(lcore_id);
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
rte_power_exit(lcore_id);
break;
}
@@ -332,7 +557,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
* ports before calling any of these API's, so we can assume that the
* callbacks can be freed. we're intentionally casting away const-ness.
*/
- rte_free((void *)queue_cfg->cur_cb);
+ rte_free((void *)queue_cfg->cb);
+ free(queue_cfg);
return 0;
}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+ size_t i;
+
+ /* initialize all tailqs */
+ for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
+ struct pmd_core_cfg *cfg = &lcore_cfgs[i];
+ TAILQ_INIT(&cfg->head);
+ }
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
2021-07-05 15:22 ` [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-07-06 18:50 ` Ananyev, Konstantin
2021-07-07 10:06 ` Burakov, Anatoly
2021-07-07 10:04 ` David Hunt
1 sibling, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-07-06 18:50 UTC (permalink / raw)
To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara
> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
>
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing. This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
>
> - Replace per-queue structures with per-lcore ones, so that any device
> polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
> added to the list of queues to poll, so that the callback is aware of
> other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
> shared between all queues polled on a particular lcore, and is only
> activated when all queues in the list were polled and were determined
> to have no traffic.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
> is incapable of monitoring more than one address.
>
> Also, while we're at it, update and improve the docs.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
> v6:
> - Track each individual queue sleep status (Konstantin)
> - Fix segfault (Dave)
>
> v5:
> - Remove the "power save queue" API and replace it with mechanism suggested by
> Konstantin
>
> v3:
> - Move the list of supported NICs to NIC feature table
>
> v2:
> - Use a TAILQ for queues instead of a static array
> - Address feedback from Konstantin
> - Add additional checks for stopped queues
>
> doc/guides/nics/features.rst | 10 +
> doc/guides/prog_guide/power_man.rst | 65 ++--
> doc/guides/rel_notes/release_21_08.rst | 3 +
> lib/power/rte_power_pmd_mgmt.c | 452 +++++++++++++++++++------
> 4 files changed, 394 insertions(+), 136 deletions(-)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index 403c2b03a3..a96e12d155 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
> * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
> * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
>
> +.. _nic_features_get_monitor_addr:
> +
> +PMD power management using monitor addresses
> +--------------------------------------------
> +
> +Supports getting a monitoring condition to use together with Ethernet PMD power
> +management (see :doc:`../prog_guide/power_man` for more details).
> +
> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
> +
> .. _nic_features_other:
>
> Other dev ops not represented by a Feature
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index c70ae128ac..ec04a72108 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -198,34 +198,41 @@ Ethernet PMD Power Management API
> Abstract
> ~~~~~~~~
>
> -Existing power management mechanisms require developers
> -to change application design or change code to make use of it.
> -The PMD power management API provides a convenient alternative
> -by utilizing Ethernet PMD RX callbacks,
> -and triggering power saving whenever empty poll count reaches a certain number.
> -
> -Monitor
> - This power saving scheme will put the CPU into optimized power state
> - and use the ``rte_power_monitor()`` function
> - to monitor the Ethernet PMD RX descriptor address,
> - and wake the CPU up whenever there's new traffic.
> -
> -Pause
> - This power saving scheme will avoid busy polling
> - by either entering power-optimized sleep state
> - with ``rte_power_pause()`` function,
> - or, if it's not available, use ``rte_pause()``.
> -
> -Frequency scaling
> - This power saving scheme will use ``librte_power`` library
> - functionality to scale the core frequency up/down
> - depending on traffic volume.
> -
> -.. note::
> -
> - Currently, this power management API is limited to mandatory mapping
> - of 1 queue to 1 core (multiple queues are supported,
> - but they must be polled from different cores).
> +Existing power management mechanisms require developers to change application
> +design or change code to make use of it. The PMD power management API provides a
> +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
> +power saving whenever empty poll count reaches a certain number.
> +
> +* Monitor
> + This power saving scheme will put the CPU into optimized power state and
> + monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
> + there's new traffic. Support for this scheme may not be available on all
> + platforms, and further limitations may apply (see below).
> +
> +* Pause
> + This power saving scheme will avoid busy polling by either entering
> + power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
> + not supported by the underlying platform, use ``rte_pause()``.
> +
> +* Frequency scaling
> + This power saving scheme will use ``librte_power`` library functionality to
> + scale the core frequency up/down depending on traffic volume.
> +
> +The "monitor" mode is only supported in the following configurations and scenarios:
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that
> + ``rte_power_monitor()`` is supported by the platform, then monitoring will be
> + limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
> + monitored from a different lcore).
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
> + ``rte_power_monitor()`` function is not supported, then monitor mode will not
> + be supported.
> +
> +* Not all Ethernet drivers support monitoring, even if the underlying
> + platform may support the necessary CPU instructions. Please refer to
> + :doc:`../nics/overview` for more information.
> +
....
> +static inline void
> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> + const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
> +
> + /* reset empty poll counter for this queue */
> + qcfg->n_empty_polls = 0;
> + /* reset the queue sleep counter as well */
> + qcfg->n_sleeps = 0;
> + /* remove the queue from list of cores ready to sleep */
> + if (is_ready_to_sleep)
> + cfg->n_queues_ready_to_sleep--;
> + /*
> + * no need change the lcore sleep target counter because this lcore will
> + * reach the n_sleeps anyway, and the other cores are already counted so
> + * there's no need to do anything else.
> + */
> +}
> +
> +static inline bool
> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> + /* this function is called - that means we have an empty poll */
> + qcfg->n_empty_polls++;
> +
> + /* if we haven't reached threshold for empty polls, we can't sleep */
> + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
> + return false;
> +
> + /*
> + * we've reached a point where we are able to sleep, but we still need
> + * to check if this queue has already been marked for sleeping.
> + */
> + if (qcfg->n_sleeps == cfg->sleep_target)
> + return true;
> +
> + /* mark this queue as ready for sleep */
> + qcfg->n_sleeps = cfg->sleep_target;
> + cfg->n_queues_ready_to_sleep++;
So, assuming there is no incoming traffic, should it be:
1) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times=1); sleep; poll_all_queues(times=1); sleep; ...
OR
2) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; ...
?
My initial thought was 2) but might be the intention is 1)?
> +
> + return true;
> +}
> +
> +static inline bool
> +lcore_can_sleep(struct pmd_core_cfg *cfg)
> +{
> + /* are all queues ready to sleep? */
> + if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
> + return false;
> +
> + /* we've reached an iteration where we can sleep, reset sleep counter */
> + cfg->n_queues_ready_to_sleep = 0;
> + cfg->sleep_target++;
> +
> + return true;
> +}
> +
> static uint16_t
> clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> - uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> - void *addr __rte_unused)
> + uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
> {
> + struct queue_list_entry *queue_conf = arg;
>
> - struct pmd_queue_cfg *q_conf;
> -
> - q_conf = &port_cfg[port_id][qidx];
> -
> + /* this callback can't do more than one queue, omit multiqueue logic */
> if (unlikely(nb_rx == 0)) {
> - q_conf->empty_poll_stats++;
> - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> + queue_conf->n_empty_polls++;
> + if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
> struct rte_power_monitor_cond pmc;
> - uint16_t ret;
> + int ret;
>
> /* use monitoring condition to sleep */
> ret = rte_eth_get_monitor_addr(port_id, qidx,
> @@ -97,60 +231,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> rte_power_monitor(&pmc, UINT64_MAX);
> }
> } else
> - q_conf->empty_poll_stats = 0;
> + queue_conf->n_empty_polls = 0;
>
> return nb_rx;
> }
>
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
2021-07-06 18:50 ` Ananyev, Konstantin
@ 2021-07-07 10:06 ` Burakov, Anatoly
2021-07-07 10:11 ` Ananyev, Konstantin
0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-07 10:06 UTC (permalink / raw)
To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara
On 06-Jul-21 7:50 PM, Ananyev, Konstantin wrote:
>
>> Currently, there is a hard limitation on the PMD power management
>> support that only allows it to support a single queue per lcore. This is
>> not ideal as most DPDK use cases will poll multiple queues per core.
>>
>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>> is very difficult to implement such support because callbacks are
>> effectively stateless and have no visibility into what the other ethdev
>> devices are doing. This places limitations on what we can do within the
>> framework of Rx callbacks, but the basics of this implementation are as
>> follows:
>>
>> - Replace per-queue structures with per-lcore ones, so that any device
>> polled from the same lcore can share data
>> - Any queue that is going to be polled from a specific lcore has to be
>> added to the list of queues to poll, so that the callback is aware of
>> other queues being polled by the same lcore
>> - Both the empty poll counter and the actual power saving mechanism is
>> shared between all queues polled on a particular lcore, and is only
>> activated when all queues in the list were polled and were determined
>> to have no traffic.
>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>> is incapable of monitoring more than one address.
>>
>> Also, while we're at it, update and improve the docs.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>> v6:
>> - Track each individual queue sleep status (Konstantin)
>> - Fix segfault (Dave)
>>
>> v5:
>> - Remove the "power save queue" API and replace it with mechanism suggested by
>> Konstantin
>>
>> v3:
>> - Move the list of supported NICs to NIC feature table
>>
>> v2:
>> - Use a TAILQ for queues instead of a static array
>> - Address feedback from Konstantin
>> - Add additional checks for stopped queues
>>
<snip>
> ....
>> +static inline void
>> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>> +{
>> + const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
>> +
>> + /* reset empty poll counter for this queue */
>> + qcfg->n_empty_polls = 0;
>> + /* reset the queue sleep counter as well */
>> + qcfg->n_sleeps = 0;
>> + /* remove the queue from list of cores ready to sleep */
>> + if (is_ready_to_sleep)
>> + cfg->n_queues_ready_to_sleep--;
>> + /*
>> + * no need change the lcore sleep target counter because this lcore will
>> + * reach the n_sleeps anyway, and the other cores are already counted so
>> + * there's no need to do anything else.
>> + */
>> +}
>> +
>> +static inline bool
>> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>> +{
>> + /* this function is called - that means we have an empty poll */
>> + qcfg->n_empty_polls++;
>> +
>> + /* if we haven't reached threshold for empty polls, we can't sleep */
>> + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
>> + return false;
>> +
>> + /*
>> + * we've reached a point where we are able to sleep, but we still need
>> + * to check if this queue has already been marked for sleeping.
>> + */
>> + if (qcfg->n_sleeps == cfg->sleep_target)
>> + return true;
>> +
>> + /* mark this queue as ready for sleep */
>> + qcfg->n_sleeps = cfg->sleep_target;
>> + cfg->n_queues_ready_to_sleep++;
>
> So, assuming there is no incoming traffic, should it be:
> 1) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times=1); sleep; poll_all_queues(times=1); sleep; ...
> OR
> 2) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; ...
> ?
>
> My initial thought was 2) but might be the intention is 1)?
The intent is 1), not 2). There's no need to wait for more empty polls
once we pass the threshold - we keep sleeping until there's traffic.
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
2021-07-07 10:06 ` Burakov, Anatoly
@ 2021-07-07 10:11 ` Ananyev, Konstantin
2021-07-07 11:54 ` Burakov, Anatoly
0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-07-07 10:11 UTC (permalink / raw)
To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara
> >
> >> Currently, there is a hard limitation on the PMD power management
> >> support that only allows it to support a single queue per lcore. This is
> >> not ideal as most DPDK use cases will poll multiple queues per core.
> >>
> >> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> >> is very difficult to implement such support because callbacks are
> >> effectively stateless and have no visibility into what the other ethdev
> >> devices are doing. This places limitations on what we can do within the
> >> framework of Rx callbacks, but the basics of this implementation are as
> >> follows:
> >>
> >> - Replace per-queue structures with per-lcore ones, so that any device
> >> polled from the same lcore can share data
> >> - Any queue that is going to be polled from a specific lcore has to be
> >> added to the list of queues to poll, so that the callback is aware of
> >> other queues being polled by the same lcore
> >> - Both the empty poll counter and the actual power saving mechanism is
> >> shared between all queues polled on a particular lcore, and is only
> >> activated when all queues in the list were polled and were determined
> >> to have no traffic.
> >> - The limitation on UMWAIT-based polling is not removed because UMWAIT
> >> is incapable of monitoring more than one address.
> >>
> >> Also, while we're at it, update and improve the docs.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>
> >> Notes:
> >> v6:
> >> - Track each individual queue sleep status (Konstantin)
> >> - Fix segfault (Dave)
> >>
> >> v5:
> >> - Remove the "power save queue" API and replace it with mechanism suggested by
> >> Konstantin
> >>
> >> v3:
> >> - Move the list of supported NICs to NIC feature table
> >>
> >> v2:
> >> - Use a TAILQ for queues instead of a static array
> >> - Address feedback from Konstantin
> >> - Add additional checks for stopped queues
> >>
>
> <snip>
>
> > ....
> >> +static inline void
> >> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> >> +{
> >> + const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
> >> +
> >> + /* reset empty poll counter for this queue */
> >> + qcfg->n_empty_polls = 0;
> >> + /* reset the queue sleep counter as well */
> >> + qcfg->n_sleeps = 0;
> >> + /* remove the queue from list of cores ready to sleep */
> >> + if (is_ready_to_sleep)
> >> + cfg->n_queues_ready_to_sleep--;
> >> + /*
> >> + * no need change the lcore sleep target counter because this lcore will
> >> + * reach the n_sleeps anyway, and the other cores are already counted so
> >> + * there's no need to do anything else.
> >> + */
> >> +}
> >> +
> >> +static inline bool
> >> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> >> +{
> >> + /* this function is called - that means we have an empty poll */
> >> + qcfg->n_empty_polls++;
> >> +
> >> + /* if we haven't reached threshold for empty polls, we can't sleep */
> >> + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
> >> + return false;
> >> +
> >> + /*
> >> + * we've reached a point where we are able to sleep, but we still need
> >> + * to check if this queue has already been marked for sleeping.
> >> + */
> >> + if (qcfg->n_sleeps == cfg->sleep_target)
> >> + return true;
> >> +
> >> + /* mark this queue as ready for sleep */
> >> + qcfg->n_sleeps = cfg->sleep_target;
> >> + cfg->n_queues_ready_to_sleep++;
> >
> > So, assuming there is no incoming traffic, should it be:
> > 1) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times=1); sleep; poll_all_queues(times=1); sleep; ...
> > OR
> > 2) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; poll_all_queues(times=
> EMPTYPOLL_MAX); sleep; ...
> > ?
> >
> > My initial thought was 2) but might be the intention is 1)?
>
>
> The intent is 1), not 2). There's no need to wait for more empty polls
> once we pass the threshold - we keep sleeping until there's traffic.
>
Ok, then:
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Probably worth to put extra explanation here on in the doc,
to help people avoid wrong assumptions😉
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
2021-07-07 10:11 ` Ananyev, Konstantin
@ 2021-07-07 11:54 ` Burakov, Anatoly
2021-07-07 12:51 ` Ananyev, Konstantin
0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-07 11:54 UTC (permalink / raw)
To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara
On 07-Jul-21 11:11 AM, Ananyev, Konstantin wrote:
>>>
>>>> Currently, there is a hard limitation on the PMD power management
>>>> support that only allows it to support a single queue per lcore. This is
>>>> not ideal as most DPDK use cases will poll multiple queues per core.
>>>>
>>>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>>>> is very difficult to implement such support because callbacks are
>>>> effectively stateless and have no visibility into what the other ethdev
>>>> devices are doing. This places limitations on what we can do within the
>>>> framework of Rx callbacks, but the basics of this implementation are as
>>>> follows:
>>>>
>>>> - Replace per-queue structures with per-lcore ones, so that any device
>>>> polled from the same lcore can share data
>>>> - Any queue that is going to be polled from a specific lcore has to be
>>>> added to the list of queues to poll, so that the callback is aware of
>>>> other queues being polled by the same lcore
>>>> - Both the empty poll counter and the actual power saving mechanism is
>>>> shared between all queues polled on a particular lcore, and is only
>>>> activated when all queues in the list were polled and were determined
>>>> to have no traffic.
>>>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>>> is incapable of monitoring more than one address.
>>>>
>>>> Also, while we're at it, update and improve the docs.
>>>>
>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> ---
>>>>
>>>> Notes:
>>>> v6:
>>>> - Track each individual queue sleep status (Konstantin)
>>>> - Fix segfault (Dave)
>>>>
>>>> v5:
>>>> - Remove the "power save queue" API and replace it with mechanism suggested by
>>>> Konstantin
>>>>
>>>> v3:
>>>> - Move the list of supported NICs to NIC feature table
>>>>
>>>> v2:
>>>> - Use a TAILQ for queues instead of a static array
>>>> - Address feedback from Konstantin
>>>> - Add additional checks for stopped queues
>>>>
>>
>> <snip>
>>
>>> ....
>>>> +static inline void
>>>> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>>>> +{
>>>> + const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
>>>> +
>>>> + /* reset empty poll counter for this queue */
>>>> + qcfg->n_empty_polls = 0;
>>>> + /* reset the queue sleep counter as well */
>>>> + qcfg->n_sleeps = 0;
>>>> + /* remove the queue from list of cores ready to sleep */
>>>> + if (is_ready_to_sleep)
>>>> + cfg->n_queues_ready_to_sleep--;
>>>> + /*
>>>> + * no need change the lcore sleep target counter because this lcore will
>>>> + * reach the n_sleeps anyway, and the other cores are already counted so
>>>> + * there's no need to do anything else.
>>>> + */
>>>> +}
>>>> +
>>>> +static inline bool
>>>> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>>>> +{
>>>> + /* this function is called - that means we have an empty poll */
>>>> + qcfg->n_empty_polls++;
>>>> +
>>>> + /* if we haven't reached threshold for empty polls, we can't sleep */
>>>> + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
>>>> + return false;
>>>> +
>>>> + /*
>>>> + * we've reached a point where we are able to sleep, but we still need
>>>> + * to check if this queue has already been marked for sleeping.
>>>> + */
>>>> + if (qcfg->n_sleeps == cfg->sleep_target)
>>>> + return true;
>>>> +
>>>> + /* mark this queue as ready for sleep */
>>>> + qcfg->n_sleeps = cfg->sleep_target;
>>>> + cfg->n_queues_ready_to_sleep++;
>>>
>>> So, assuming there is no incoming traffic, should it be:
>>> 1) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times=1); sleep; poll_all_queues(times=1); sleep; ...
>>> OR
>>> 2) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; poll_all_queues(times=
>> EMPTYPOLL_MAX); sleep; ...
>>> ?
>>>
>>> My initial thought was 2) but might be the intention is 1)?
>>
>>
>> The intent is 1), not 2). There's no need to wait for more empty polls
>> once we pass the threshold - we keep sleeping until there's traffic.
>>
>
> Ok, then:
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>
> Probably worth to put extra explanation here on in the doc,
> to help people avoid wrong assumptions😉
>
I don't see value in going into such details. What would be the point?
Like, what difference would this information make to anyone?
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
2021-07-07 11:54 ` Burakov, Anatoly
@ 2021-07-07 12:51 ` Ananyev, Konstantin
2021-07-07 14:35 ` Burakov, Anatoly
0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-07-07 12:51 UTC (permalink / raw)
To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara
> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Wednesday, July 7, 2021 12:54 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Hunt, David <david.hunt@intel.com>
> Cc: Loftus, Ciara <ciara.loftus@intel.com>
> Subject: Re: [PATCH v6 5/7] power: support callbacks for multiple Rx queues
>
> On 07-Jul-21 11:11 AM, Ananyev, Konstantin wrote:
> >>>
> >>>> Currently, there is a hard limitation on the PMD power management
> >>>> support that only allows it to support a single queue per lcore. This is
> >>>> not ideal as most DPDK use cases will poll multiple queues per core.
> >>>>
> >>>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> >>>> is very difficult to implement such support because callbacks are
> >>>> effectively stateless and have no visibility into what the other ethdev
> >>>> devices are doing. This places limitations on what we can do within the
> >>>> framework of Rx callbacks, but the basics of this implementation are as
> >>>> follows:
> >>>>
> >>>> - Replace per-queue structures with per-lcore ones, so that any device
> >>>> polled from the same lcore can share data
> >>>> - Any queue that is going to be polled from a specific lcore has to be
> >>>> added to the list of queues to poll, so that the callback is aware of
> >>>> other queues being polled by the same lcore
> >>>> - Both the empty poll counter and the actual power saving mechanism is
> >>>> shared between all queues polled on a particular lcore, and is only
> >>>> activated when all queues in the list were polled and were determined
> >>>> to have no traffic.
> >>>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
> >>>> is incapable of monitoring more than one address.
> >>>>
> >>>> Also, while we're at it, update and improve the docs.
> >>>>
> >>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>> ---
> >>>>
> >>>> Notes:
> >>>> v6:
> >>>> - Track each individual queue sleep status (Konstantin)
> >>>> - Fix segfault (Dave)
> >>>>
> >>>> v5:
> >>>> - Remove the "power save queue" API and replace it with mechanism suggested by
> >>>> Konstantin
> >>>>
> >>>> v3:
> >>>> - Move the list of supported NICs to NIC feature table
> >>>>
> >>>> v2:
> >>>> - Use a TAILQ for queues instead of a static array
> >>>> - Address feedback from Konstantin
> >>>> - Add additional checks for stopped queues
> >>>>
> >>
> >> <snip>
> >>
> >>> ....
> >>>> +static inline void
> >>>> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> >>>> +{
> >>>> + const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
> >>>> +
> >>>> + /* reset empty poll counter for this queue */
> >>>> + qcfg->n_empty_polls = 0;
> >>>> + /* reset the queue sleep counter as well */
> >>>> + qcfg->n_sleeps = 0;
> >>>> + /* remove the queue from list of cores ready to sleep */
> >>>> + if (is_ready_to_sleep)
> >>>> + cfg->n_queues_ready_to_sleep--;
> >>>> + /*
> >>>> + * no need change the lcore sleep target counter because this lcore will
> >>>> + * reach the n_sleeps anyway, and the other cores are already counted so
> >>>> + * there's no need to do anything else.
> >>>> + */
> >>>> +}
> >>>> +
> >>>> +static inline bool
> >>>> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> >>>> +{
> >>>> + /* this function is called - that means we have an empty poll */
> >>>> + qcfg->n_empty_polls++;
> >>>> +
> >>>> + /* if we haven't reached threshold for empty polls, we can't sleep */
> >>>> + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
> >>>> + return false;
> >>>> +
> >>>> + /*
> >>>> + * we've reached a point where we are able to sleep, but we still need
> >>>> + * to check if this queue has already been marked for sleeping.
> >>>> + */
> >>>> + if (qcfg->n_sleeps == cfg->sleep_target)
> >>>> + return true;
> >>>> +
> >>>> + /* mark this queue as ready for sleep */
> >>>> + qcfg->n_sleeps = cfg->sleep_target;
> >>>> + cfg->n_queues_ready_to_sleep++;
> >>>
> >>> So, assuming there is no incoming traffic, should it be:
> >>> 1) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times=1); sleep; poll_all_queues(times=1); sleep; ...
> >>> OR
> >>> 2) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; poll_all_queues(times=
> >> EMPTYPOLL_MAX); sleep; ...
> >>> ?
> >>>
> >>> My initial thought was 2) but might be the intention is 1)?
> >>
> >>
> >> The intent is 1), not 2). There's no need to wait for more empty polls
> >> once we pass the threshold - we keep sleeping until there's traffic.
> >>
> >
> > Ok, then:
> > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> >
> > Probably worth to put extra explanation here on in the doc,
> > to help people avoid wrong assumptions😉
> >
>
> I don't see value in going into such details. What would be the point?
> Like, what difference would this information make to anyone?
I thought it is obvious: if you put extra explanation into the code,
then it would be easier for anyone who reads it (reviewers/maintainers/users)
to understand what it supposed to do.
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
2021-07-07 12:51 ` Ananyev, Konstantin
@ 2021-07-07 14:35 ` Burakov, Anatoly
2021-07-07 17:09 ` Ananyev, Konstantin
0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-07 14:35 UTC (permalink / raw)
To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara
On 07-Jul-21 1:51 PM, Ananyev, Konstantin wrote:
>
>
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Wednesday, July 7, 2021 12:54 PM
>> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Hunt, David <david.hunt@intel.com>
>> Cc: Loftus, Ciara <ciara.loftus@intel.com>
>> Subject: Re: [PATCH v6 5/7] power: support callbacks for multiple Rx queues
>>
>> On 07-Jul-21 11:11 AM, Ananyev, Konstantin wrote:
>>>>>
>>>>>> Currently, there is a hard limitation on the PMD power management
>>>>>> support that only allows it to support a single queue per lcore. This is
>>>>>> not ideal as most DPDK use cases will poll multiple queues per core.
>>>>>>
>>>>>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>>>>>> is very difficult to implement such support because callbacks are
>>>>>> effectively stateless and have no visibility into what the other ethdev
>>>>>> devices are doing. This places limitations on what we can do within the
>>>>>> framework of Rx callbacks, but the basics of this implementation are as
>>>>>> follows:
>>>>>>
>>>>>> - Replace per-queue structures with per-lcore ones, so that any device
>>>>>> polled from the same lcore can share data
>>>>>> - Any queue that is going to be polled from a specific lcore has to be
>>>>>> added to the list of queues to poll, so that the callback is aware of
>>>>>> other queues being polled by the same lcore
>>>>>> - Both the empty poll counter and the actual power saving mechanism is
>>>>>> shared between all queues polled on a particular lcore, and is only
>>>>>> activated when all queues in the list were polled and were determined
>>>>>> to have no traffic.
>>>>>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>>>>> is incapable of monitoring more than one address.
>>>>>>
>>>>>> Also, while we're at it, update and improve the docs.
>>>>>>
>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>> ---
>>>>>>
>>>>>> Notes:
>>>>>> v6:
>>>>>> - Track each individual queue sleep status (Konstantin)
>>>>>> - Fix segfault (Dave)
>>>>>>
>>>>>> v5:
>>>>>> - Remove the "power save queue" API and replace it with mechanism suggested by
>>>>>> Konstantin
>>>>>>
>>>>>> v3:
>>>>>> - Move the list of supported NICs to NIC feature table
>>>>>>
>>>>>> v2:
>>>>>> - Use a TAILQ for queues instead of a static array
>>>>>> - Address feedback from Konstantin
>>>>>> - Add additional checks for stopped queues
>>>>>>
>>>>
>>>> <snip>
>>>>
>>>>> ....
>>>>>> +static inline void
>>>>>> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>>>>>> +{
>>>>>> + const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
>>>>>> +
>>>>>> + /* reset empty poll counter for this queue */
>>>>>> + qcfg->n_empty_polls = 0;
>>>>>> + /* reset the queue sleep counter as well */
>>>>>> + qcfg->n_sleeps = 0;
>>>>>> + /* remove the queue from list of cores ready to sleep */
>>>>>> + if (is_ready_to_sleep)
>>>>>> + cfg->n_queues_ready_to_sleep--;
>>>>>> + /*
>>>>>> + * no need change the lcore sleep target counter because this lcore will
>>>>>> + * reach the n_sleeps anyway, and the other cores are already counted so
>>>>>> + * there's no need to do anything else.
>>>>>> + */
>>>>>> +}
>>>>>> +
>>>>>> +static inline bool
>>>>>> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>>>>>> +{
>>>>>> + /* this function is called - that means we have an empty poll */
>>>>>> + qcfg->n_empty_polls++;
>>>>>> +
>>>>>> + /* if we haven't reached threshold for empty polls, we can't sleep */
>>>>>> + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
>>>>>> + return false;
>>>>>> +
>>>>>> + /*
>>>>>> + * we've reached a point where we are able to sleep, but we still need
>>>>>> + * to check if this queue has already been marked for sleeping.
>>>>>> + */
>>>>>> + if (qcfg->n_sleeps == cfg->sleep_target)
>>>>>> + return true;
>>>>>> +
>>>>>> + /* mark this queue as ready for sleep */
>>>>>> + qcfg->n_sleeps = cfg->sleep_target;
>>>>>> + cfg->n_queues_ready_to_sleep++;
>>>>>
>>>>> So, assuming there is no incoming traffic, should it be:
>>>>> 1) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times=1); sleep; poll_all_queues(times=1); sleep; ...
>>>>> OR
>>>>> 2) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; poll_all_queues(times=
>>>> EMPTYPOLL_MAX); sleep; ...
>>>>> ?
>>>>>
>>>>> My initial thought was 2) but might be the intention is 1)?
>>>>
>>>>
>>>> The intent is 1), not 2). There's no need to wait for more empty polls
>>>> once we pass the threshold - we keep sleeping until there's traffic.
>>>>
>>>
>>> Ok, then:
>>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>>>
>>> Probably worth to put extra explanation here on in the doc,
>>> to help people avoid wrong assumptions😉
>>>
>>
>> I don't see value in going into such details. What would be the point?
>> Like, what difference would this information make to anyone?
>
> I thought it is obvious: if you put extra explanation into the code,
> then it would be easier for anyone who reads it (reviewers/maintainers/users)
> to understand what it supposed to do.
>
You're suggesting to put this *in the doc*, which implies that *the
user* will find this information useful. I'm OK with adding this info as
a comment somewhere perhaps, but why put it in the doc?
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
2021-07-07 14:35 ` Burakov, Anatoly
@ 2021-07-07 17:09 ` Ananyev, Konstantin
0 siblings, 0 replies; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-07-07 17:09 UTC (permalink / raw)
To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara
> >>>>>
> >>>>>> Currently, there is a hard limitation on the PMD power management
> >>>>>> support that only allows it to support a single queue per lcore. This is
> >>>>>> not ideal as most DPDK use cases will poll multiple queues per core.
> >>>>>>
> >>>>>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> >>>>>> is very difficult to implement such support because callbacks are
> >>>>>> effectively stateless and have no visibility into what the other ethdev
> >>>>>> devices are doing. This places limitations on what we can do within the
> >>>>>> framework of Rx callbacks, but the basics of this implementation are as
> >>>>>> follows:
> >>>>>>
> >>>>>> - Replace per-queue structures with per-lcore ones, so that any device
> >>>>>> polled from the same lcore can share data
> >>>>>> - Any queue that is going to be polled from a specific lcore has to be
> >>>>>> added to the list of queues to poll, so that the callback is aware of
> >>>>>> other queues being polled by the same lcore
> >>>>>> - Both the empty poll counter and the actual power saving mechanism is
> >>>>>> shared between all queues polled on a particular lcore, and is only
> >>>>>> activated when all queues in the list were polled and were determined
> >>>>>> to have no traffic.
> >>>>>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
> >>>>>> is incapable of monitoring more than one address.
> >>>>>>
> >>>>>> Also, while we're at it, update and improve the docs.
> >>>>>>
> >>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>> ---
> >>>>>>
> >>>>>> Notes:
> >>>>>> v6:
> >>>>>> - Track each individual queue sleep status (Konstantin)
> >>>>>> - Fix segfault (Dave)
> >>>>>>
> >>>>>> v5:
> >>>>>> - Remove the "power save queue" API and replace it with mechanism suggested by
> >>>>>> Konstantin
> >>>>>>
> >>>>>> v3:
> >>>>>> - Move the list of supported NICs to NIC feature table
> >>>>>>
> >>>>>> v2:
> >>>>>> - Use a TAILQ for queues instead of a static array
> >>>>>> - Address feedback from Konstantin
> >>>>>> - Add additional checks for stopped queues
> >>>>>>
> >>>>
> >>>> <snip>
> >>>>
> >>>>> ....
> >>>>>> +static inline void
> >>>>>> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> >>>>>> +{
> >>>>>> + const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
> >>>>>> +
> >>>>>> + /* reset empty poll counter for this queue */
> >>>>>> + qcfg->n_empty_polls = 0;
> >>>>>> + /* reset the queue sleep counter as well */
> >>>>>> + qcfg->n_sleeps = 0;
> >>>>>> + /* remove the queue from list of cores ready to sleep */
> >>>>>> + if (is_ready_to_sleep)
> >>>>>> + cfg->n_queues_ready_to_sleep--;
> >>>>>> + /*
> >>>>>> + * no need change the lcore sleep target counter because this lcore will
> >>>>>> + * reach the n_sleeps anyway, and the other cores are already counted so
> >>>>>> + * there's no need to do anything else.
> >>>>>> + */
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline bool
> >>>>>> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> >>>>>> +{
> >>>>>> + /* this function is called - that means we have an empty poll */
> >>>>>> + qcfg->n_empty_polls++;
> >>>>>> +
> >>>>>> + /* if we haven't reached threshold for empty polls, we can't sleep */
> >>>>>> + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
> >>>>>> + return false;
> >>>>>> +
> >>>>>> + /*
> >>>>>> + * we've reached a point where we are able to sleep, but we still need
> >>>>>> + * to check if this queue has already been marked for sleeping.
> >>>>>> + */
> >>>>>> + if (qcfg->n_sleeps == cfg->sleep_target)
> >>>>>> + return true;
> >>>>>> +
> >>>>>> + /* mark this queue as ready for sleep */
> >>>>>> + qcfg->n_sleeps = cfg->sleep_target;
> >>>>>> + cfg->n_queues_ready_to_sleep++;
> >>>>>
> >>>>> So, assuming there is no incoming traffic, should it be:
> >>>>> 1) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times=1); sleep; poll_all_queues(times=1); sleep; ...
> >>>>> OR
> >>>>> 2) poll_all_queues(times=EMPTYPOLL_MAX); sleep; poll_all_queues(times= EMPTYPOLL_MAX); sleep; poll_all_queues(times=
> >>>> EMPTYPOLL_MAX); sleep; ...
> >>>>> ?
> >>>>>
> >>>>> My initial thought was 2) but might be the intention is 1)?
> >>>>
> >>>>
> >>>> The intent is 1), not 2). There's no need to wait for more empty polls
> >>>> once we pass the threshold - we keep sleeping until there's traffic.
> >>>>
> >>>
> >>> Ok, then:
> >>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> >>>
> >>> Probably worth to put extra explanation here on in the doc,
> >>> to help people avoid wrong assumptions😉
> >>>
> >>
> >> I don't see value in going into such details. What would be the point?
> >> Like, what difference would this information make to anyone?
> >
> > I thought it is obvious: if you put extra explanation into the code,
> > then it would be easier for anyone who reads it (reviewers/maintainers/users)
> > to understand what it supposed to do.
> >
>
> You're suggesting to put this *in the doc*, which implies that *the
> user* will find this information useful. I'm OK with adding this info as
> a comment somewhere perhaps, but why put it in the doc?
I don't really mind where you'll put it, either extra comments in that file,
or few lines in the doc - both are ok to me.
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
2021-07-05 15:22 ` [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
2021-07-06 18:50 ` Ananyev, Konstantin
@ 2021-07-07 10:04 ` David Hunt
2021-07-07 10:28 ` Burakov, Anatoly
1 sibling, 1 reply; 165+ messages in thread
From: David Hunt @ 2021-07-07 10:04 UTC (permalink / raw)
To: Anatoly Burakov, dev; +Cc: ciara.loftus, konstantin.ananyev
On 5/7/2021 4:22 PM, Anatoly Burakov wrote:
> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
>
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing. This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
>
> - Replace per-queue structures with per-lcore ones, so that any device
> polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
> added to the list of queues to poll, so that the callback is aware of
> other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
> shared between all queues polled on a particular lcore, and is only
> activated when all queues in the list were polled and were determined
> to have no traffic.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
> is incapable of monitoring more than one address.
>
> Also, while we're at it, update and improve the docs.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
> v6:
> - Track each individual queue sleep status (Konstantin)
> - Fix segfault (Dave)
>
> v5:
> - Remove the "power save queue" API and replace it with mechanism suggested by
> Konstantin
>
> v3:
> - Move the list of supported NICs to NIC feature table
>
> v2:
> - Use a TAILQ for queues instead of a static array
> - Address feedback from Konstantin
> - Add additional checks for stopped queues
>
> doc/guides/nics/features.rst | 10 +
> doc/guides/prog_guide/power_man.rst | 65 ++--
> doc/guides/rel_notes/release_21_08.rst | 3 +
> lib/power/rte_power_pmd_mgmt.c | 452 +++++++++++++++++++------
> 4 files changed, 394 insertions(+), 136 deletions(-)
>
--snip--
>
> +static inline void
> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> + const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
> +
> + /* reset empty poll counter for this queue */
> + qcfg->n_empty_polls = 0;
> + /* reset the queue sleep counter as well */
> + qcfg->n_sleeps = 0;
> + /* remove the queue from list of cores ready to sleep */
> + if (is_ready_to_sleep)
> + cfg->n_queues_ready_to_sleep--;
Hi Anatoly,
I don't think the logic around this is bulletproof yet, in my
testing I'm seeing n_queues_ready_to_sleep wrap around (i.e. decremented
while already zero).
Rgds,
Dave.
--snip--
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
2021-07-07 10:04 ` David Hunt
@ 2021-07-07 10:28 ` Burakov, Anatoly
0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-07 10:28 UTC (permalink / raw)
To: David Hunt, dev; +Cc: ciara.loftus, konstantin.ananyev
On 07-Jul-21 11:04 AM, David Hunt wrote:
>
> On 5/7/2021 4:22 PM, Anatoly Burakov wrote:
>> Currently, there is a hard limitation on the PMD power management
>> support that only allows it to support a single queue per lcore. This is
>> not ideal as most DPDK use cases will poll multiple queues per core.
>>
>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>> is very difficult to implement such support because callbacks are
>> effectively stateless and have no visibility into what the other ethdev
>> devices are doing. This places limitations on what we can do within the
>> framework of Rx callbacks, but the basics of this implementation are as
>> follows:
>>
>> - Replace per-queue structures with per-lcore ones, so that any device
>> polled from the same lcore can share data
>> - Any queue that is going to be polled from a specific lcore has to be
>> added to the list of queues to poll, so that the callback is aware of
>> other queues being polled by the same lcore
>> - Both the empty poll counter and the actual power saving mechanism is
>> shared between all queues polled on a particular lcore, and is only
>> activated when all queues in the list were polled and were determined
>> to have no traffic.
>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>> is incapable of monitoring more than one address.
>>
>> Also, while we're at it, update and improve the docs.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>> v6:
>> - Track each individual queue sleep status (Konstantin)
>> - Fix segfault (Dave)
>> v5:
>> - Remove the "power save queue" API and replace it with mechanism
>> suggested by
>> Konstantin
>> v3:
>> - Move the list of supported NICs to NIC feature table
>> v2:
>> - Use a TAILQ for queues instead of a static array
>> - Address feedback from Konstantin
>> - Add additional checks for stopped queues
>>
>> doc/guides/nics/features.rst | 10 +
>> doc/guides/prog_guide/power_man.rst | 65 ++--
>> doc/guides/rel_notes/release_21_08.rst | 3 +
>> lib/power/rte_power_pmd_mgmt.c | 452 +++++++++++++++++++------
>> 4 files changed, 394 insertions(+), 136 deletions(-)
>>
>
> --snip--
>
>
>> +static inline void
>> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>> +{
>> + const bool is_ready_to_sleep = qcfg->n_empty_polls > EMPTYPOLL_MAX;
>> +
>> + /* reset empty poll counter for this queue */
>> + qcfg->n_empty_polls = 0;
>> + /* reset the queue sleep counter as well */
>> + qcfg->n_sleeps = 0;
>> + /* remove the queue from list of cores ready to sleep */
>> + if (is_ready_to_sleep)
>> + cfg->n_queues_ready_to_sleep--;
>
>
> Hi Anatoly,
>
> I don't think the logic around this is bulletproof yet, in my
> testing I'm seeing n_queues_ready_to_sleep wrap around (i.e. decremented
> while already zero).
>
> Rgds,
> Dave.
>
>
> --snip--
Thanks for your testing!
It seems that number of empty polls is not a reliable indicator of
whether the queue is ready to sleep, because if we get a non-empty poll
right after sleep, we'll have empty poll counter still at high value,
which will cause the n_queues_ready_to_sleep to decrement, even though
it's at zero because we just had a sleep.
Using n_sleeps and sleep_target is better in this case. I'll submit a v7
with this fix. Thanks!
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v6 6/7] power: support monitoring multiple Rx queues
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
` (4 preceding siblings ...)
2021-07-05 15:22 ` [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-07-05 15:22 ` Anatoly Burakov
2021-07-07 10:16 ` Ananyev, Konstantin
2021-07-05 15:22 ` [dpdk-dev] [PATCH v6 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management Anatoly Burakov
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:22 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v6:
- Fix the missed feedback from v5
v4:
- Fix possible out of bounds access
- Added missing index increment
doc/guides/prog_guide/power_man.rst | 9 ++--
lib/power/rte_power_pmd_mgmt.c | 82 ++++++++++++++++++++++++++++-
2 files changed, 86 insertions(+), 5 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index ec04a72108..94353ca012 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
The "monitor" mode is only supported in the following configurations and scenarios:
* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor_multi()`` function is supported by the platform, then
+ monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
``rte_power_monitor()`` is supported by the platform, then monitoring will be
limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
monitored from a different lcore).
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
- ``rte_power_monitor()`` function is not supported, then monitor mode will not
- be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+ two monitoring functions are supported, then monitor mode will not be supported.
* Not all Ethernet drivers support monitoring, even if the underlying
platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9ffeda05ed..0c45469619 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -126,6 +126,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
return found;
}
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+ struct rte_power_monitor_cond *pmc, size_t len)
+{
+ const struct queue_list_entry *qle;
+ size_t i = 0;
+ int ret;
+
+ TAILQ_FOREACH(qle, &cfg->head, next) {
+ const union queue *q = &qle->queue;
+ struct rte_power_monitor_cond *cur;
+
+ /* attempted out of bounds access */
+ if (i >= len) {
+ RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
+ return -1;
+ }
+
+ cur = &pmc[i++];
+ ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
static void
calc_tsc(void)
{
@@ -211,6 +237,46 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
return true;
}
+static uint16_t
+clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
+{
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
+
+ lcore_conf = &lcore_cfgs[lcore];
+
+ /* early exit */
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ struct rte_power_monitor_cond pmc[lcore_conf->n_queues];
+ int ret;
+
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* gather all monitoring conditions */
+ ret = get_monitor_addresses(lcore_conf, pmc,
+ lcore_conf->n_queues);
+ if (ret < 0)
+ return nb_rx;
+
+ rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX);
+ }
+
+ return nb_rx;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
@@ -362,14 +428,19 @@ static int
check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
{
struct rte_power_monitor_cond dummy;
+ bool multimonitor_supported;
/* check if rte_power_monitor is supported */
if (!global_data.intrinsics_support.power_monitor) {
RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
return -ENOTSUP;
}
+ /* check if multi-monitor is supported */
+ multimonitor_supported =
+ global_data.intrinsics_support.power_monitor_multi;
- if (cfg->n_queues > 0) {
+ /* if we're adding a new queue, do we support multiple queues? */
+ if (cfg->n_queues > 0 && !multimonitor_supported) {
RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
return -ENOTSUP;
}
@@ -385,6 +456,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
return 0;
}
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+ return global_data.intrinsics_support.power_monitor_multi ?
+ clb_multiwait : clb_umwait;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -449,7 +527,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (ret < 0)
goto end;
- clb = clb_umwait;
+ clb = get_monitor_callback();
break;
case RTE_POWER_MGMT_TYPE_SCALE:
/* check if we can add a new queue */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v6 6/7] power: support monitoring multiple Rx queues
2021-07-05 15:22 ` [dpdk-dev] [PATCH v6 6/7] power: support monitoring " Anatoly Burakov
@ 2021-07-07 10:16 ` Ananyev, Konstantin
0 siblings, 0 replies; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-07-07 10:16 UTC (permalink / raw)
To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara
>
> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
> Rx queues while entering the energy efficient power state. The multi
> version will be used unconditionally if supported, and the UMWAIT one
> will only be used when multi-monitor is not supported by the hardware.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> --
> 2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v6 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
` (5 preceding siblings ...)
2021-07-05 15:22 ` [dpdk-dev] [PATCH v6 6/7] power: support monitoring " Anatoly Burakov
@ 2021-07-05 15:22 ` Anatoly Burakov
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management Anatoly Burakov
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:22 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
examples/l3fwd-power/main.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..52f56dc405 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2723,12 +2723,6 @@ main(int argc, char **argv)
printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
fflush(stdout);
- /* PMD power management mode can only do 1 queue per core */
- if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
- rte_exit(EXIT_FAILURE,
- "In PMD power management mode, only one queue per lcore is allowed\n");
- }
-
/* init RX queues */
for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
struct rte_eth_rxconf rxq_conf;
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management
2021-07-05 15:21 ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
` (6 preceding siblings ...)
2021-07-05 15:22 ` [dpdk-dev] [PATCH v6 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-07-07 10:48 ` Anatoly Burakov
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
` (7 more replies)
7 siblings, 8 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-07 10:48 UTC (permalink / raw)
To: dev; +Cc: konstantin.ananyev, ciara.loftus, david.hunt
This patchset introduces several changes related to PMD power management:
- Changed monitoring intrinsics to use callbacks as a comparison function, based
on previous patchset [1] but incorporating feedback [2] - this hopefully will
make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
accompanying infrastructure and example apps changes
v7:
- Fixed various bugs
v6:
- Improved the algorithm for multi-queue sleep
- Fixed segfault and addressed other feedback
v5:
- Removed "power save queue" API and replaced with mechanism suggested by
Konstantin
- Addressed other feedback
v4:
- Replaced raw number with a macro
- Fixed all the bugs found by Konstantin
- Some other minor corrections
v3:
- Moved some doc updates to NIC features list
v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary
[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274
Anatoly Burakov (7):
power_intrinsics: use callbacks for comparison
net/af_xdp: add power monitor support
eal: add power monitor for multiple events
power: remove thread safety from PMD power API's
power: support callbacks for multiple Rx queues
power: support monitoring multiple Rx queues
l3fwd-power: support multiqueue in PMD pmgmt modes
doc/guides/nics/features.rst | 10 +
doc/guides/prog_guide/power_man.rst | 74 +-
doc/guides/rel_notes/release_21_08.rst | 9 +
drivers/event/dlb2/dlb2.c | 17 +-
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +
drivers/net/i40e/i40e_rxtx.c | 20 +-
drivers/net/iavf/iavf_rxtx.c | 20 +-
drivers/net/ice/ice_rxtx.c | 20 +-
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +-
drivers/net/mlx5/mlx5_rx.c | 17 +-
examples/l3fwd-power/main.c | 6 -
lib/eal/arm/rte_power_intrinsics.c | 11 +
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 68 +-
lib/eal/ppc/rte_power_intrinsics.c | 11 +
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 90 ++-
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 659 +++++++++++++-----
lib/power/rte_power_pmd_mgmt.h | 6 +
21 files changed, 840 insertions(+), 262 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v7 1/7] power_intrinsics: use callbacks for comparison
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-07-07 10:48 ` Anatoly Burakov
2021-07-07 11:56 ` David Hunt
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 2/7] net/af_xdp: add power monitor support Anatoly Burakov
` (6 subsequent siblings)
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-07 10:48 UTC (permalink / raw)
To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
Cc: ciara.loftus, david.hunt
Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.
This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.
Existing implementations are adjusted to follow the new semantics.
Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
Notes:
v4:
- Return error if callback is set to NULL
- Replace raw number with a macro in monitor condition opaque data
v2:
- Use callback mechanism for more flexibility
- Address feedback from Konstantin
doc/guides/rel_notes/release_21_08.rst | 2 ++
drivers/event/dlb2/dlb2.c | 17 ++++++++--
drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
drivers/net/ice/ice_rxtx.c | 20 +++++++----
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
.../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
9 files changed, 122 insertions(+), 44 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index cd02820e68..c1d063bb11 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -117,6 +117,8 @@ API Changes
* eal: ``rte_strscpy`` sets ``rte_errno`` to ``E2BIG`` in case of string
truncation.
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+
ABI Changes
-----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
}
}
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ /* abort if the value matches */
+ return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
static inline int
dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
expected_value = 0;
pmc.addr = monitor_addr;
- pmc.val = expected_value;
- pmc.mask = qe_mask.raw_qe[1];
+ /* store expected value and comparison mask in opaque data */
+ pmc.opaque[CLB_VAL_IDX] = expected_value;
+ pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+ /* set up callback */
+ pmc.fn = dlb2_monitor_callback;
pmc.size = sizeof(uint64_t);
rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 8d65f287f4..65f325ede1 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
#define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
+static int
+i40e_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = i40e_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index f817fbc49b..d61b32fcee 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
}
+static int
+iavf_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = iavf_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 3f6e735984..5d7ab4f047 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
+static int
+ice_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.status_error0;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
- pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /* comparison callback */
+ pmc->fn = ice_monitor_callback;
/* register is 16-bit */
pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
};
+static int
+ixgbe_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.upper.status_error;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
- pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /* comparison callback */
+ pmc->fn = ixgbe_monitor_callback;
/* the registers are 32-bit */
pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..17370b77dc 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
return rx_queue_count(rxq);
}
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t m = opaque[CLB_MSK_IDX];
+ const uint64_t v = opaque[CLB_VAL_IDX];
+
+ return (value & m) == v ? -1 : 0;
+}
+
int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
return -rte_errno;
}
pmc->addr = &cqe->op_own;
- pmc->val = !!idx;
- pmc->mask = MLX5_CQE_OWNER_MASK;
+ pmc->opaque[CLB_VAL_IDX] = !!idx;
+ pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+ pmc->fn = mlx_monitor_callback;
pmc->size = sizeof(uint8_t);
return 0;
}
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
* which are architecture-dependent.
*/
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ * The value read from memory.
+ * @param opaque
+ * Callback-specific data.
+ *
+ * @return
+ * 0 if entering of power optimized state should proceed
+ * -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
struct rte_power_monitor_cond {
volatile void *addr; /**< Address to monitor for changes */
- uint64_t val; /**< If the `mask` is non-zero, location pointed
- * to by `addr` will be read and compared
- * against this value.
- */
- uint64_t mask; /**< 64-bit mask to extract value read from `addr` */
- uint8_t size; /**< Data size (in bytes) that will be used to compare
- * expected value (`val`) with data read from the
+ uint8_t size; /**< Data size (in bytes) that will be read from the
* monitored memory location (`addr`). Can be 1, 2,
* 4, or 8. Supplying any other value will result in
* an error.
*/
+ rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+ * entering power optimized state should
+ * be aborted.
+ */
+ uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+ /**< Callback-specific data */
};
/**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
const unsigned int lcore_id = rte_lcore_id();
struct power_wait_status *s;
+ uint64_t cur_value;
/* prevent user from running this instruction if it's not supported */
if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
if (__check_val_size(pmc->size) < 0)
return -EINVAL;
+ if (pmc->fn == NULL)
+ return -EINVAL;
+
s = &wait_status[lcore_id];
/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
/* now that we've put this address into monitor, we can unlock */
rte_spinlock_unlock(&s->lock);
- /* if we have a comparison mask, we might not need to sleep at all */
- if (pmc->mask) {
- const uint64_t cur_value = __get_umwait_val(
- pmc->addr, pmc->size);
- const uint64_t masked = cur_value & pmc->mask;
+ cur_value = __get_umwait_val(pmc->addr, pmc->size);
- /* if the masked value is already matching, abort */
- if (masked == pmc->val)
- goto end;
- }
+ /* check if callback indicates we should abort */
+ if (pmc->fn(cur_value, pmc->opaque) != 0)
+ goto end;
/* execute UMWAIT */
asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v7 1/7] power_intrinsics: use callbacks for comparison
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-07-07 11:56 ` David Hunt
0 siblings, 0 replies; 165+ messages in thread
From: David Hunt @ 2021-07-07 11:56 UTC (permalink / raw)
To: Anatoly Burakov, dev, Timothy McDaniel, Beilei Xing, Jingjing Wu,
Qiming Yang, Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
Cc: ciara.loftus
On 7/7/2021 11:48 AM, Anatoly Burakov wrote:
> Previously, the semantics of power monitor were such that we were
> checking current value against the expected value, and if they matched,
> then the sleep was aborted. This is somewhat inflexible, because it only
> allowed us to check for a specific value in a specific way.
>
> This commit replaces the comparison with a user callback mechanism, so
> that any PMD (or other code) using `rte_power_monitor()` can define
> their own comparison semantics and decision making on how to detect the
> need to abort the entering of power optimized state.
>
> Existing implementations are adjusted to follow the new semantics.
>
> Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>
> Notes:
> v4:
> - Return error if callback is set to NULL
> - Replace raw number with a macro in monitor condition opaque data
>
> v2:
> - Use callback mechanism for more flexibility
> - Address feedback from Konstantin
>
>
Tested-by: David Hunt <david.hunt@intel.com>
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v7 2/7] net/af_xdp: add power monitor support
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management Anatoly Burakov
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-07-07 10:48 ` Anatoly Burakov
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 3/7] eal: add power monitor for multiple events Anatoly Burakov
` (5 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-07 10:48 UTC (permalink / raw)
To: dev, Ciara Loftus, Qi Zhang; +Cc: konstantin.ananyev, david.hunt
Implement support for .get_monitor_addr in AF_XDP driver.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v2:
- Rewrite using the callback mechanism
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..7830d0c23a 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
#include <rte_malloc.h>
#include <rte_ring.h>
#include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
#include "compat.h"
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
return 0;
}
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t v = opaque[CLB_VAL_IDX];
+ const uint64_t m = (uint32_t)~0;
+
+ /* if the value has changed, abort entering power optimized state */
+ return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+ struct pkt_rx_queue *rxq = rx_queue;
+ unsigned int *prod = rxq->rx.producer;
+ const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+ /* watch for changes in producer ring */
+ pmc->addr = (void*)prod;
+
+ /* store current value */
+ pmc->opaque[CLB_VAL_IDX] = cur_val;
+ pmc->fn = eth_monitor_callback;
+
+ /* AF_XDP producer ring index is 32-bit */
+ pmc->size = sizeof(uint32_t);
+
+ return 0;
+}
+
static int
eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
{
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
.link_update = eth_link_update,
.stats_get = eth_stats_get,
.stats_reset = eth_stats_reset,
+ .get_monitor_addr = eth_get_monitor_addr
};
/** parse busy_budget argument */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v7 3/7] eal: add power monitor for multiple events
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management Anatoly Burakov
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-07-07 10:48 ` Anatoly Burakov
2021-07-07 12:01 ` David Hunt
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
` (4 subsequent siblings)
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-07 10:48 UTC (permalink / raw)
To: dev, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
Cc: ciara.loftus, david.hunt
Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v4:
- Fixed bugs in accessing the monitor condition
- Abort on any monitor condition not having a defined callback
v2:
- Adapt to callback mechanism
lib/eal/arm/rte_power_intrinsics.c | 11 +++
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 35 +++++++++
lib/eal/ppc/rte_power_intrinsics.c | 11 +++
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++
7 files changed, 137 insertions(+)
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
/**< indicates support for rte_power_monitor function */
uint32_t power_pause : 1;
/**< indicates support for rte_power_pause function */
+ uint32_t power_monitor_multi : 1;
+ /**< indicates support for rte_power_monitor_multi function */
};
/**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
__rte_experimental
int rte_power_pause(const uint64_t tsc_timestamp);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ * supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ * Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ * An array of monitoring condition structures.
+ * @param num
+ * Length of the `pmc` array.
+ * @param tsc_timestamp
+ * Maximum TSC timestamp to wait for. Note that the wait behavior is
+ * architecture-dependent.
+ *
+ * @return
+ * 0 on success
+ * -EINVAL on invalid parameters
+ * -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp);
+
#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..4ccd5475d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
rte_version_release; # WINDOWS_NO_EXPORT
rte_version_suffix; # WINDOWS_NO_EXPORT
rte_version_year; # WINDOWS_NO_EXPORT
+
+ # added in 21.08
+ rte_power_monitor_multi; # WINDOWS_NO_EXPORT
};
INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
intrinsics->power_monitor = 1;
intrinsics->power_pause = 1;
+ if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+ intrinsics->power_monitor_multi = 1;
}
}
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
#include <rte_common.h>
#include <rte_lcore.h>
+#include <rte_rtm.h>
#include <rte_spinlock.h>
#include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
}
static bool wait_supported;
+static bool wait_multi_supported;
static inline uint64_t
__get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
if (i.power_monitor && i.power_pause)
wait_supported = 1;
+ if (i.power_monitor_multi)
+ wait_multi_supported = 1;
}
int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
* In this case, since we've already woken up, the "wakeup" was
* unneeded, and since T1 is still waiting on T2 releasing the lock, the
* wakeup address is still valid so it's perfectly safe to write it.
+ *
+ * For multi-monitor case, the act of locking will in itself trigger the
+ * wakeup, so no additional writes necessary.
*/
rte_spinlock_lock(&s->lock);
if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return 0;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ const unsigned int lcore_id = rte_lcore_id();
+ struct power_wait_status *s = &wait_status[lcore_id];
+ uint32_t i, rc;
+
+ /* check if supported */
+ if (!wait_multi_supported)
+ return -ENOTSUP;
+
+ if (pmc == NULL || num == 0)
+ return -EINVAL;
+
+ /* we are already inside transaction region, return */
+ if (rte_xtest() != 0)
+ return 0;
+
+ /* start new transaction region */
+ rc = rte_xbegin();
+
+ /* transaction abort, possible write to one of wait addresses */
+ if (rc != RTE_XBEGIN_STARTED)
+ return 0;
+
+ /*
+ * the mere act of reading the lock status here adds the lock to
+ * the read set. This means that when we trigger a wakeup from another
+ * thread, even if we don't have a defined wakeup address and thus don't
+ * actually cause any writes, the act of locking our lock will itself
+ * trigger the wakeup and abort the transaction.
+ */
+ rte_spinlock_is_locked(&s->lock);
+
+ /*
+ * add all addresses to wait on into transaction read-set and check if
+ * any of wakeup conditions are already met.
+ */
+ rc = 0;
+ for (i = 0; i < num; i++) {
+ const struct rte_power_monitor_cond *c = &pmc[i];
+
+ /* cannot be NULL */
+ if (c->fn == NULL) {
+ rc = -EINVAL;
+ break;
+ }
+
+ const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+ /* abort if callback indicates that we need to stop */
+ if (c->fn(val, c->opaque) != 0)
+ break;
+ }
+
+ /* none of the conditions were met, sleep until timeout */
+ if (i == num)
+ rte_power_pause(tsc_timestamp);
+
+ /* end transaction region */
+ rte_xend();
+
+ return rc;
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v7 3/7] eal: add power monitor for multiple events
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-07-07 12:01 ` David Hunt
0 siblings, 0 replies; 165+ messages in thread
From: David Hunt @ 2021-07-07 12:01 UTC (permalink / raw)
To: Anatoly Burakov, dev, Jan Viktorin, Ruifeng Wang, Jerin Jacob,
David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson,
Konstantin Ananyev
Cc: ciara.loftus
On 7/7/2021 11:48 AM, Anatoly Burakov wrote:
> Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
> what UMWAIT does, but without the limitation of having to listen for
> just one event. This works because the optimized power state used by the
> TPAUSE instruction will cause a wake up on RTM transaction abort, so if
> we add the addresses we're interested in to the read-set, any write to
> those addresses will wake us up.
>
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
> v4:
> - Fixed bugs in accessing the monitor condition
> - Abort on any monitor condition not having a defined callback
>
> v2:
> - Adapt to callback mechanism
>
Initially I had issues running this as I couldn't see the "rtm" flag in
/proc/cpuinfo, but adding "tsx=on" to the kernel parameters resolved
this, and rtm was then available to use. Once this was done, I was then
able to successfully see power saving when a core was configuired with
multiple queues.
Tested-by: David Hunt <david.hunt@intel.com>
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v7 4/7] power: remove thread safety from PMD power API's
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management Anatoly Burakov
` (2 preceding siblings ...)
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-07-07 10:48 ` Anatoly Burakov
2021-07-07 12:02 ` David Hunt
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
` (3 subsequent siblings)
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-07 10:48 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.
We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v2:
- Add check for stopped queue
- Clarified doc message
- Added release notes
doc/guides/rel_notes/release_21_08.rst | 4 +
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 133 ++++++++++---------------
lib/power/rte_power_pmd_mgmt.h | 6 ++
4 files changed, 66 insertions(+), 80 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index c1d063bb11..4b84c89c0b 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -119,6 +119,10 @@ API Changes
* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+* rte_power: The experimental PMD power management API is no longer considered
+ to be thread safe; all Rx queues affected by the API will now need to be
+ stopped before making any changes to the power management scheme.
+
ABI Changes
-----------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
'rte_power_pmd_mgmt.h',
'rte_power_guest_channel.h',
)
+if cc.has_argument('-Wno-cast-qual')
+ cflags += '-Wno-cast-qual'
+endif
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
/**< Callback mode for this queue */
const struct rte_eth_rxtx_callback *cur_cb;
/**< Callback instance */
- volatile bool umwait_in_progress;
- /**< are we currently sleeping? */
uint64_t empty_poll_stats;
/**< Number of empty polls */
} __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
struct rte_power_monitor_cond pmc;
uint16_t ret;
- /*
- * we might get a cancellation request while being
- * inside the callback, in which case the wakeup
- * wouldn't work because it would've arrived too early.
- *
- * to get around this, we notify the other thread that
- * we're sleeping, so that it can spin until we're done.
- * unsolicited wakeups are perfectly safe.
- */
- q_conf->umwait_in_progress = true;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- /* check if we need to cancel sleep */
- if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
- /* use monitoring condition to sleep */
- ret = rte_eth_get_monitor_addr(port_id, qidx,
- &pmc);
- if (ret == 0)
- rte_power_monitor(&pmc, UINT64_MAX);
- }
- q_conf->umwait_in_progress = false;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+ /* use monitoring condition to sleep */
+ ret = rte_eth_get_monitor_addr(port_id, qidx,
+ &pmc);
+ if (ret == 0)
+ rte_power_monitor(&pmc, UINT64_MAX);
}
} else
q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
return nb_rx;
}
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+ struct rte_eth_rxq_info qinfo;
+
+ if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+ return -1;
+
+ return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
struct pmd_queue_cfg *queue_cfg;
struct rte_eth_dev_info info;
+ rte_rx_callback_fn clb;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
queue_cfg = &port_cfg[port_id][queue_id];
if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->umwait_in_progress = false;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* ensure we update our state before callback starts */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_umwait, NULL);
+ clb = clb_umwait;
break;
}
case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
- queue_id, clb_scale_freq, NULL);
+ clb = clb_scale_freq;
break;
}
case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (global_data.tsc_per_us == 0)
calc_tsc();
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_pause, NULL);
+ clb = clb_pause;
break;
+ default:
+ RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+ ret = -EINVAL;
+ goto end;
}
+
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, NULL);
+
ret = 0;
end:
return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
struct pmd_queue_cfg *queue_cfg;
+ int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
return -EINVAL;
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
/* no need to check queue id as wrong queue id would not be enabled */
queue_cfg = &port_cfg[port_id][queue_id];
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
/* stop any callbacks from progressing */
queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
- /* ensure we update our state before continuing */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
switch (queue_cfg->cb_mode) {
- case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- bool exit = false;
- do {
- /*
- * we may request cancellation while the other thread
- * has just entered the callback but hasn't started
- * sleeping yet, so keep waking it up until we know it's
- * done sleeping.
- */
- if (queue_cfg->umwait_in_progress)
- rte_power_monitor_wakeup(lcore_id);
- else
- exit = true;
- } while (!exit);
- }
- /* fall-through */
+ case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
rte_eth_remove_rx_callback(port_id, queue_id,
queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
break;
}
/*
- * we don't free the RX callback here because it is unsafe to do so
- * unless we know for a fact that all data plane threads have stopped.
+ * the API doc mandates that the user stops all processing on affected
+ * ports before calling any of these API's, so we can assume that the
+ * callbacks can be freed. we're intentionally casting away const-ness.
*/
- queue_cfg->cur_cb = NULL;
+ rte_free((void *)queue_cfg->cur_cb);
return 0;
}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue will be polled from.
* @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue is polled from.
* @param port_id
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v7 4/7] power: remove thread safety from PMD power API's
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-07-07 12:02 ` David Hunt
0 siblings, 0 replies; 165+ messages in thread
From: David Hunt @ 2021-07-07 12:02 UTC (permalink / raw)
To: Anatoly Burakov, dev; +Cc: konstantin.ananyev, ciara.loftus
On 7/7/2021 11:48 AM, Anatoly Burakov wrote:
> Currently, we expect that only one callback can be active at any given
> moment, for a particular queue configuration, which is relatively easy
> to implement in a thread-safe way. However, we're about to add support
> for multiple queues per lcore, which will greatly increase the
> possibility of various race conditions.
>
> We could have used something like an RCU for this use case, but absent
> of a pressing need for thread safety we'll go the easy way and just
> mandate that the API's are to be called when all affected ports are
> stopped, and document this limitation. This greatly simplifies the
> `rte_power_monitor`-related code.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
> v2:
> - Add check for stopped queue
> - Clarified doc message
> - Added release notes
>
Tested-by: David Hunt <david.hunt@intel.com>
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v7 5/7] power: support callbacks for multiple Rx queues
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management Anatoly Burakov
` (3 preceding siblings ...)
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-07-07 10:48 ` Anatoly Burakov
2021-07-07 11:54 ` David Hunt
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 6/7] power: support monitoring " Anatoly Burakov
` (2 subsequent siblings)
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-07 10:48 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.
The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:
- Replace per-queue structures with per-lcore ones, so that any device
polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
added to the list of queues to poll, so that the callback is aware of
other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
shared between all queues polled on a particular lcore, and is only
activated when all queues in the list were polled and were determined
to have no traffic.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
is incapable of monitoring more than one address.
Also, while we're at it, update and improve the docs.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v7:
- Fix bug where initial sleep target was always set to zero
- Fix logic in handling of n_queues_ready_to_sleep counter
- Update documentation on hardware requirements
v6:
- Track each individual queue sleep status (Konstantin)
- Fix segfault (Dave)
v5:
- Remove the "power save queue" API and replace it with mechanism suggested by
Konstantin
v3:
- Move the list of supported NICs to NIC feature table
v2:
- Use a TAILQ for queues instead of a static array
- Address feedback from Konstantin
- Add additional checks for stopped queues
doc/guides/nics/features.rst | 10 +
doc/guides/prog_guide/power_man.rst | 69 ++--
doc/guides/rel_notes/release_21_08.rst | 3 +
lib/power/rte_power_pmd_mgmt.c | 456 +++++++++++++++++++------
4 files changed, 402 insertions(+), 136 deletions(-)
diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
* **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
* **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
.. _nic_features_other:
Other dev ops not represented by a Feature
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..0e66878892 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,45 @@ Ethernet PMD Power Management API
Abstract
~~~~~~~~
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
- This power saving scheme will put the CPU into optimized power state
- and use the ``rte_power_monitor()`` function
- to monitor the Ethernet PMD RX descriptor address,
- and wake the CPU up whenever there's new traffic.
-
-Pause
- This power saving scheme will avoid busy polling
- by either entering power-optimized sleep state
- with ``rte_power_pause()`` function,
- or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
- This power saving scheme will use ``librte_power`` library
- functionality to scale the core frequency up/down
- depending on traffic volume.
-
-.. note::
-
- Currently, this power management API is limited to mandatory mapping
- of 1 queue to 1 core (multiple queues are supported,
- but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+ This power saving scheme will put the CPU into optimized power state and
+ monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+ there's new traffic. Support for this scheme may not be available on all
+ platforms, and further limitations may apply (see below).
+
+* Pause
+ This power saving scheme will avoid busy polling by either entering
+ power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+ not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+ This power saving scheme will use ``librte_power`` library functionality to
+ scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* On Linux* x86_64, `rte_power_monitor()` requires WAITPKG instruction set being
+ supported by the CPU. Please refer to your platform documentation for further
+ information.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+ limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+ monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+ ``rte_power_monitor()`` function is not supported, then monitor mode will not
+ be supported.
+
+* Not all Ethernet drivers support monitoring, even if the underlying
+ platform may support the necessary CPU instructions. Please refer to
+ :doc:`../nics/overview` for more information.
+
API Overview for Ethernet PMD Power Management
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -242,3 +253,5 @@ References
* The :doc:`../sample_app_ug/vm_power_management`
chapter in the :doc:`../sample_app_ug/index` section.
+
+* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 4b84c89c0b..fce50f0cd6 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -85,6 +85,9 @@ New Features
usecases. Configuration happens via standard rawdev enq/deq operations. See
the :doc:`../rawdevs/cnxk_bphy` rawdev guide for more details on this driver.
+* rte_power: The experimental PMD power management API now supports managing
+ multiple Ethernet Rx queues per lcore.
+
Removed Items
-------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..ceaf386d2b 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,18 +33,98 @@ enum pmd_mgmt_state {
PMD_MGMT_ENABLED
};
-struct pmd_queue_cfg {
+union queue {
+ uint32_t val;
+ struct {
+ uint16_t portid;
+ uint16_t qid;
+ };
+};
+
+struct queue_list_entry {
+ TAILQ_ENTRY(queue_list_entry) next;
+ union queue queue;
+ uint64_t n_empty_polls;
+ uint64_t n_sleeps;
+ const struct rte_eth_rxtx_callback *cb;
+};
+
+struct pmd_core_cfg {
+ TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+ /**< List of queues associated with this lcore */
+ size_t n_queues;
+ /**< How many queues are in the list? */
volatile enum pmd_mgmt_state pwr_mgmt_state;
/**< State of power management for this queue */
enum rte_power_pmd_mgmt_type cb_mode;
/**< Callback mode for this queue */
- const struct rte_eth_rxtx_callback *cur_cb;
- /**< Callback instance */
- uint64_t empty_poll_stats;
- /**< Number of empty polls */
+ uint64_t n_queues_ready_to_sleep;
+ /**< Number of queues ready to enter power optimized state */
+ uint64_t sleep_target;
+ /**< Prevent a queue from triggering sleep multiple times */
} __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+ return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+ dst->val = src->val;
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *cur;
+
+ TAILQ_FOREACH(cur, &cfg->head, next) {
+ if (queue_equal(&cur->queue, q))
+ return cur;
+ }
+ return NULL;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *qle;
+
+ /* is it already in the list? */
+ if (queue_list_find(cfg, q) != NULL)
+ return -EEXIST;
+
+ qle = malloc(sizeof(*qle));
+ if (qle == NULL)
+ return -ENOMEM;
+ memset(qle, 0, sizeof(*qle));
+
+ queue_copy(&qle->queue, q);
+ TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+ cfg->n_queues++;
+
+ return 0;
+}
+
+static struct queue_list_entry *
+queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *found;
+
+ found = queue_list_find(cfg, q);
+ if (found == NULL)
+ return NULL;
+
+ TAILQ_REMOVE(&cfg->head, found, next);
+ cfg->n_queues--;
+
+ /* freeing is responsibility of the caller */
+ return found;
+}
static void
calc_tsc(void)
@@ -74,21 +154,75 @@ calc_tsc(void)
}
}
+static inline void
+queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ const bool is_ready_to_sleep = qcfg->n_sleeps == cfg->sleep_target;
+
+ /* reset empty poll counter for this queue */
+ qcfg->n_empty_polls = 0;
+ /* reset the queue sleep counter as well */
+ qcfg->n_sleeps = 0;
+ /* remove the queue from list of queues ready to sleep */
+ if (is_ready_to_sleep)
+ cfg->n_queues_ready_to_sleep--;
+ /*
+ * no need change the lcore sleep target counter because this lcore will
+ * reach the n_sleeps anyway, and the other cores are already counted so
+ * there's no need to do anything else.
+ */
+}
+
+static inline bool
+queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ /* this function is called - that means we have an empty poll */
+ qcfg->n_empty_polls++;
+
+ /* if we haven't reached threshold for empty polls, we can't sleep */
+ if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
+ return false;
+
+ /*
+ * we've reached a point where we are able to sleep, but we still need
+ * to check if this queue has already been marked for sleeping.
+ */
+ if (qcfg->n_sleeps == cfg->sleep_target)
+ return true;
+
+ /* mark this queue as ready for sleep */
+ qcfg->n_sleeps = cfg->sleep_target;
+ cfg->n_queues_ready_to_sleep++;
+
+ return true;
+}
+
+static inline bool
+lcore_can_sleep(struct pmd_core_cfg *cfg)
+{
+ /* are all queues ready to sleep? */
+ if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
+ return false;
+
+ /* we've reached an iteration where we can sleep, reset sleep counter */
+ cfg->n_queues_ready_to_sleep = 0;
+ cfg->sleep_target++;
+
+ return true;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+ uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
{
+ struct queue_list_entry *queue_conf = arg;
- struct pmd_queue_cfg *q_conf;
-
- q_conf = &port_cfg[port_id][qidx];
-
+ /* this callback can't do more than one queue, omit multiqueue logic */
if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+ queue_conf->n_empty_polls++;
+ if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
struct rte_power_monitor_cond pmc;
- uint16_t ret;
+ int ret;
/* use monitoring condition to sleep */
ret = rte_eth_get_monitor_addr(port_id, qidx,
@@ -97,60 +231,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
rte_power_monitor(&pmc, UINT64_MAX);
}
} else
- q_conf->empty_poll_stats = 0;
+ queue_conf->n_empty_polls = 0;
return nb_rx;
}
static uint16_t
-clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
- q_conf = &port_cfg[port_id][qidx];
+ lcore_conf = &lcore_cfgs[lcore];
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- /* sleep for 1 microsecond */
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
- /* use tpause if we have it */
- if (global_data.intrinsics_support.power_pause) {
- const uint64_t cur = rte_rdtsc();
- const uint64_t wait_tsc =
- cur + global_data.tsc_per_us;
- rte_power_pause(wait_tsc);
- } else {
- uint64_t i;
- for (i = 0; i < global_data.pause_per_us; i++)
- rte_pause();
- }
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* sleep for 1 microsecond, use tpause if we have it */
+ if (global_data.intrinsics_support.power_pause) {
+ const uint64_t cur = rte_rdtsc();
+ const uint64_t wait_tsc =
+ cur + global_data.tsc_per_us;
+ rte_power_pause(wait_tsc);
+ } else {
+ uint64_t i;
+ for (i = 0; i < global_data.pause_per_us; i++)
+ rte_pause();
}
- } else
- q_conf->empty_poll_stats = 0;
+ }
return nb_rx;
}
static uint16_t
-clb_scale_freq(uint16_t port_id, uint16_t qidx,
+clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
- uint16_t max_pkts __rte_unused, void *_ __rte_unused)
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ const bool empty = nb_rx == 0;
+ struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+ struct queue_list_entry *queue_conf = arg;
- q_conf = &port_cfg[port_id][qidx];
+ if (likely(!empty)) {
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
- /* scale down freq */
- rte_power_freq_min(rte_lcore_id());
- } else {
- q_conf->empty_poll_stats = 0;
- /* scale up freq */
+ /* scale up freq immediately */
rte_power_freq_max(rte_lcore_id());
+ } else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ rte_power_freq_min(rte_lcore_id());
}
return nb_rx;
@@ -167,11 +318,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
}
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+ const struct queue_list_entry *entry;
+
+ TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+ const union queue *q = &entry->queue;
+ int ret = queue_stopped(q->portid, q->qid);
+ if (ret != 1)
+ return ret;
+ }
+ return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+ enum power_management_env env;
+
+ /* only PSTATE and ACPI modes are supported */
+ if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+ !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+ return -ENOTSUP;
+ }
+ /* ensure we could initialize the power library */
+ if (rte_power_init(lcore))
+ return -EINVAL;
+
+ /* ensure we initialized the correct env */
+ env = rte_power_get_env();
+ if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+ struct rte_power_monitor_cond dummy;
+
+ /* check if rte_power_monitor is supported */
+ if (!global_data.intrinsics_support.power_monitor) {
+ RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+ return -ENOTSUP;
+ }
+
+ if (cfg->n_queues > 0) {
+ RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+ return -ENOTSUP;
+ }
+
+ /* check if the device supports the necessary PMD API */
+ if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+ &dummy) == -ENOTSUP) {
+ RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
struct rte_eth_dev_info info;
rte_rx_callback_fn clb;
int ret;
@@ -202,9 +422,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
+ /* if callback was already enabled, check current callback type */
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+ lcore_cfg->cb_mode != mode) {
ret = -EINVAL;
goto end;
}
@@ -214,53 +444,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
switch (mode) {
case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- struct rte_power_monitor_cond dummy;
-
- /* check if rte_power_monitor is supported */
- if (!global_data.intrinsics_support.power_monitor) {
- RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_monitor(lcore_cfg, &qdata);
+ if (ret < 0)
goto end;
- }
- /* check if the device supports the necessary PMD API */
- if (rte_eth_get_monitor_addr(port_id, queue_id,
- &dummy) == -ENOTSUP) {
- RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_umwait;
break;
- }
case RTE_POWER_MGMT_TYPE_SCALE:
- {
- enum power_management_env env;
- /* only PSTATE and ACPI modes are supported */
- if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
- !rte_power_check_env_supported(
- PM_ENV_PSTATE_CPUFREQ)) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_scale(lcore_id);
+ if (ret < 0)
goto end;
- }
- /* ensure we could initialize the power library */
- if (rte_power_init(lcore_id)) {
- ret = -EINVAL;
- goto end;
- }
- /* ensure we initialized the correct env */
- env = rte_power_get_env();
- if (env != PM_ENV_ACPI_CPUFREQ &&
- env != PM_ENV_PSTATE_CPUFREQ) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_scale_freq;
break;
- }
case RTE_POWER_MGMT_TYPE_PAUSE:
/* figure out various time-to-tsc conversions */
if (global_data.tsc_per_us == 0)
@@ -273,13 +470,27 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -EINVAL;
goto end;
}
+ /* add this queue to the list */
+ ret = queue_list_add(lcore_cfg, &qdata);
+ if (ret < 0) {
+ RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+ strerror(-ret));
+ goto end;
+ }
+ /* new queue is always added last */
+ queue_cfg = TAILQ_LAST(&lcore_cfg->head, queue_list_head);
+
+ /* when enabling first queue, ensure sleep target is not 0 */
+ if (lcore_cfg->n_queues == 1 && lcore_cfg->sleep_target == 0)
+ lcore_cfg->sleep_target = 1;
/* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb, NULL);
+ if (lcore_cfg->n_queues == 1) {
+ lcore_cfg->cb_mode = mode;
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ }
+ queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, queue_cfg);
ret = 0;
end:
@@ -290,7 +501,9 @@ int
rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,24 +519,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
}
/* no need to check queue id as wrong queue id would not be enabled */
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
return -EINVAL;
- /* stop any callbacks from progressing */
- queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+ /*
+ * There is no good/easy way to do this without race conditions, so we
+ * are just going to throw our hands in the air and hope that the user
+ * has read the documentation and has ensured that ports are stopped at
+ * the time we enter the API functions.
+ */
+ queue_cfg = queue_list_take(lcore_cfg, &qdata);
+ if (queue_cfg == NULL)
+ return -ENOENT;
- switch (queue_cfg->cb_mode) {
+ /* if we've removed all queues from the lists, set state to disabled */
+ if (lcore_cfg->n_queues == 0)
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+ switch (lcore_cfg->cb_mode) {
case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
break;
case RTE_POWER_MGMT_TYPE_SCALE:
rte_power_freq_max(lcore_id);
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
rte_power_exit(lcore_id);
break;
}
@@ -332,7 +561,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
* ports before calling any of these API's, so we can assume that the
* callbacks can be freed. we're intentionally casting away const-ness.
*/
- rte_free((void *)queue_cfg->cur_cb);
+ rte_free((void *)queue_cfg->cb);
+ free(queue_cfg);
return 0;
}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+ size_t i;
+
+ /* initialize all tailqs */
+ for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
+ struct pmd_core_cfg *cfg = &lcore_cfgs[i];
+ TAILQ_INIT(&cfg->head);
+ }
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v7 5/7] power: support callbacks for multiple Rx queues
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-07-07 11:54 ` David Hunt
0 siblings, 0 replies; 165+ messages in thread
From: David Hunt @ 2021-07-07 11:54 UTC (permalink / raw)
To: Anatoly Burakov, dev; +Cc: konstantin.ananyev, ciara.loftus
On 7/7/2021 11:48 AM, Anatoly Burakov wrote:
> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
>
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing. This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
>
> - Replace per-queue structures with per-lcore ones, so that any device
> polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
> added to the list of queues to poll, so that the callback is aware of
> other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
> shared between all queues polled on a particular lcore, and is only
> activated when all queues in the list were polled and were determined
> to have no traffic.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
> is incapable of monitoring more than one address.
>
> Also, while we're at it, update and improve the docs.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
> v7:
> - Fix bug where initial sleep target was always set to zero
> - Fix logic in handling of n_queues_ready_to_sleep counter
> - Update documentation on hardware requirements
>
> v6:
> - Track each individual queue sleep status (Konstantin)
> - Fix segfault (Dave)
>
> v5:
> - Remove the "power save queue" API and replace it with mechanism suggested by
> Konstantin
>
> v3:
> - Move the list of supported NICs to NIC feature table
>
> v2:
> - Use a TAILQ for queues instead of a static array
> - Address feedback from Konstantin
> - Add additional checks for stopped queues
>
> doc/guides/nics/features.rst | 10 +
> doc/guides/prog_guide/power_man.rst | 69 ++--
> doc/guides/rel_notes/release_21_08.rst | 3 +
> lib/power/rte_power_pmd_mgmt.c | 456 +++++++++++++++++++------
> 4 files changed, 402 insertions(+), 136 deletions(-)
>
--snip--
Thanks Anatoly. Not seeing the rollover now, and power savings are back
as expected when low traffic in both monitor and pause modes. All
previous issues seem now to be resolved.
Patch set LGTM.
Tested-by: David Hunt <david.hunt@intel.com>
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v7 6/7] power: support monitoring multiple Rx queues
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management Anatoly Burakov
` (4 preceding siblings ...)
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-07-07 10:48 ` Anatoly Burakov
2021-07-07 12:03 ` David Hunt
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management Anatoly Burakov
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-07 10:48 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v6:
- Fix the missed feedback from v5
v4:
- Fix possible out of bounds access
- Added missing index increment
doc/guides/prog_guide/power_man.rst | 15 ++++--
lib/power/rte_power_pmd_mgmt.c | 82 ++++++++++++++++++++++++++++-
2 files changed, 90 insertions(+), 7 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0e66878892..e387d7811e 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,17 +221,22 @@ power saving whenever empty poll count reaches a certain number.
The "monitor" mode is only supported in the following configurations and scenarios:
* On Linux* x86_64, `rte_power_monitor()` requires WAITPKG instruction set being
- supported by the CPU. Please refer to your platform documentation for further
- information.
+ supported by the CPU, while `rte_power_monitor_multi()` requires WAITPKG and
+ RTM instruction sets being supported by the CPU. RTM instruction set may also
+ require booting the Linux with `tsx=on` command line parameter. Please refer
+ to your platform documentation for further information.
* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor_multi()`` function is supported by the platform, then
+ monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
``rte_power_monitor()`` is supported by the platform, then monitoring will be
limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
monitored from a different lcore).
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
- ``rte_power_monitor()`` function is not supported, then monitor mode will not
- be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+ two monitoring functions are supported, then monitor mode will not be supported.
* Not all Ethernet drivers support monitoring, even if the underlying
platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index ceaf386d2b..ba5971f827 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -126,6 +126,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
return found;
}
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+ struct rte_power_monitor_cond *pmc, size_t len)
+{
+ const struct queue_list_entry *qle;
+ size_t i = 0;
+ int ret;
+
+ TAILQ_FOREACH(qle, &cfg->head, next) {
+ const union queue *q = &qle->queue;
+ struct rte_power_monitor_cond *cur;
+
+ /* attempted out of bounds access */
+ if (i >= len) {
+ RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
+ return -1;
+ }
+
+ cur = &pmc[i++];
+ ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
static void
calc_tsc(void)
{
@@ -211,6 +237,46 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
return true;
}
+static uint16_t
+clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
+{
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
+
+ lcore_conf = &lcore_cfgs[lcore];
+
+ /* early exit */
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ struct rte_power_monitor_cond pmc[lcore_conf->n_queues];
+ int ret;
+
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* gather all monitoring conditions */
+ ret = get_monitor_addresses(lcore_conf, pmc,
+ lcore_conf->n_queues);
+ if (ret < 0)
+ return nb_rx;
+
+ rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX);
+ }
+
+ return nb_rx;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
@@ -362,14 +428,19 @@ static int
check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
{
struct rte_power_monitor_cond dummy;
+ bool multimonitor_supported;
/* check if rte_power_monitor is supported */
if (!global_data.intrinsics_support.power_monitor) {
RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
return -ENOTSUP;
}
+ /* check if multi-monitor is supported */
+ multimonitor_supported =
+ global_data.intrinsics_support.power_monitor_multi;
- if (cfg->n_queues > 0) {
+ /* if we're adding a new queue, do we support multiple queues? */
+ if (cfg->n_queues > 0 && !multimonitor_supported) {
RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
return -ENOTSUP;
}
@@ -385,6 +456,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
return 0;
}
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+ return global_data.intrinsics_support.power_monitor_multi ?
+ clb_multiwait : clb_umwait;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -449,7 +527,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (ret < 0)
goto end;
- clb = clb_umwait;
+ clb = get_monitor_callback();
break;
case RTE_POWER_MGMT_TYPE_SCALE:
/* check if we can add a new queue */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v7 6/7] power: support monitoring multiple Rx queues
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 6/7] power: support monitoring " Anatoly Burakov
@ 2021-07-07 12:03 ` David Hunt
0 siblings, 0 replies; 165+ messages in thread
From: David Hunt @ 2021-07-07 12:03 UTC (permalink / raw)
To: Anatoly Burakov, dev; +Cc: konstantin.ananyev, ciara.loftus
On 7/7/2021 11:48 AM, Anatoly Burakov wrote:
> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
> Rx queues while entering the energy efficient power state. The multi
> version will be used unconditionally if supported, and the UMWAIT one
> will only be used when multi-monitor is not supported by the hardware.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
> v6:
> - Fix the missed feedback from v5
>
> v4:
> - Fix possible out of bounds access
> - Added missing index increment
>
>
--snip--
Tested-by: David Hunt <david.hunt@intel.com>
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v7 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management Anatoly Burakov
` (5 preceding siblings ...)
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 6/7] power: support monitoring " Anatoly Burakov
@ 2021-07-07 10:48 ` Anatoly Burakov
2021-07-07 12:03 ` David Hunt
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management Anatoly Burakov
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-07 10:48 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
examples/l3fwd-power/main.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..52f56dc405 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2723,12 +2723,6 @@ main(int argc, char **argv)
printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
fflush(stdout);
- /* PMD power management mode can only do 1 queue per core */
- if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
- rte_exit(EXIT_FAILURE,
- "In PMD power management mode, only one queue per lcore is allowed\n");
- }
-
/* init RX queues */
for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
struct rte_eth_rxconf rxq_conf;
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v7 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-07-07 12:03 ` David Hunt
0 siblings, 0 replies; 165+ messages in thread
From: David Hunt @ 2021-07-07 12:03 UTC (permalink / raw)
To: Anatoly Burakov, dev; +Cc: konstantin.ananyev, ciara.loftus
On 7/7/2021 11:48 AM, Anatoly Burakov wrote:
> Currently, l3fwd-power enforces the limitation of having one queue per
> lcore. This is no longer necessary, so remove the limitation.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> examples/l3fwd-power/main.c | 6 ------
> 1 file changed, 6 deletions(-)
>
> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> index f8dfed1634..52f56dc405 100644
> --- a/examples/l3fwd-power/main.c
> +++ b/examples/l3fwd-power/main.c
> @@ -2723,12 +2723,6 @@ main(int argc, char **argv)
> printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
> fflush(stdout);
>
> - /* PMD power management mode can only do 1 queue per core */
> - if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
> - rte_exit(EXIT_FAILURE,
> - "In PMD power management mode, only one queue per lcore is allowed\n");
> - }
> -
> /* init RX queues */
> for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
> struct rte_eth_rxconf rxq_conf;
Tested-by: David Hunt <david.hunt@intel.com>
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 0/7] Enhancements for PMD power management Anatoly Burakov
` (6 preceding siblings ...)
2021-07-07 10:48 ` [dpdk-dev] [PATCH v7 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-07-08 14:13 ` Anatoly Burakov
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
` (7 more replies)
7 siblings, 8 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-08 14:13 UTC (permalink / raw)
To: dev; +Cc: ciara.loftus, david.hunt, konstantin.ananyev
This patchset introduces several changes related to PMD power management:
- Changed monitoring intrinsics to use callbacks as a comparison function, based
on previous patchset [1] but incorporating feedback [2] - this hopefully will
make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
accompanying infrastructure and example apps changes
v8:
- Fixed checkpatch issue
- Added comment explaining empty poll handling (Konstantin)
v7:
- Fixed various bugs
v6:
- Improved the algorithm for multi-queue sleep
- Fixed segfault and addressed other feedback
v5:
- Removed "power save queue" API and replaced with mechanism suggested by
Konstantin
- Addressed other feedback
v4:
- Replaced raw number with a macro
- Fixed all the bugs found by Konstantin
- Some other minor corrections
v3:
- Moved some doc updates to NIC features list
v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary
[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274
Anatoly Burakov (7):
power_intrinsics: use callbacks for comparison
net/af_xdp: add power monitor support
eal: add power monitor for multiple events
power: remove thread safety from PMD power API's
power: support callbacks for multiple Rx queues
power: support monitoring multiple Rx queues
l3fwd-power: support multiqueue in PMD pmgmt modes
doc/guides/nics/features.rst | 10 +
doc/guides/prog_guide/power_man.rst | 74 +-
doc/guides/rel_notes/release_21_08.rst | 9 +
drivers/event/dlb2/dlb2.c | 17 +-
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +
drivers/net/i40e/i40e_rxtx.c | 20 +-
drivers/net/iavf/iavf_rxtx.c | 20 +-
drivers/net/ice/ice_rxtx.c | 20 +-
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +-
drivers/net/mlx5/mlx5_rx.c | 17 +-
examples/l3fwd-power/main.c | 6 -
lib/eal/arm/rte_power_intrinsics.c | 11 +
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 68 +-
lib/eal/ppc/rte_power_intrinsics.c | 11 +
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 90 ++-
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 663 +++++++++++++-----
lib/power/rte_power_pmd_mgmt.h | 6 +
21 files changed, 844 insertions(+), 262 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v8 1/7] power_intrinsics: use callbacks for comparison
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-07-08 14:13 ` Anatoly Burakov
2021-07-08 16:56 ` McDaniel, Timothy
2021-07-09 13:46 ` Thomas Monjalon
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 2/7] net/af_xdp: add power monitor support Anatoly Burakov
` (6 subsequent siblings)
7 siblings, 2 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-08 14:13 UTC (permalink / raw)
To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
Cc: ciara.loftus, david.hunt
Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.
This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.
Existing implementations are adjusted to follow the new semantics.
Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
Notes:
v4:
- Return error if callback is set to NULL
- Replace raw number with a macro in monitor condition opaque data
v2:
- Use callback mechanism for more flexibility
- Address feedback from Konstantin
doc/guides/rel_notes/release_21_08.rst | 2 ++
drivers/event/dlb2/dlb2.c | 17 ++++++++--
drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
drivers/net/ice/ice_rxtx.c | 20 +++++++----
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
.../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
9 files changed, 122 insertions(+), 44 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index c92e016783..65910de348 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -135,6 +135,8 @@ API Changes
* eal: ``rte_strscpy`` sets ``rte_errno`` to ``E2BIG`` in case of string
truncation.
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+
ABI Changes
-----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
}
}
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ /* abort if the value matches */
+ return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
static inline int
dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
expected_value = 0;
pmc.addr = monitor_addr;
- pmc.val = expected_value;
- pmc.mask = qe_mask.raw_qe[1];
+ /* store expected value and comparison mask in opaque data */
+ pmc.opaque[CLB_VAL_IDX] = expected_value;
+ pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+ /* set up callback */
+ pmc.fn = dlb2_monitor_callback;
pmc.size = sizeof(uint64_t);
rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 8d65f287f4..65f325ede1 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
#define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
+static int
+i40e_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = i40e_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index f817fbc49b..d61b32fcee 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
}
+static int
+iavf_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = iavf_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 3f6e735984..5d7ab4f047 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
+static int
+ice_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.status_error0;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
- pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /* comparison callback */
+ pmc->fn = ice_monitor_callback;
/* register is 16-bit */
pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
};
+static int
+ixgbe_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.upper.status_error;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
- pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /* comparison callback */
+ pmc->fn = ixgbe_monitor_callback;
/* the registers are 32-bit */
pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..17370b77dc 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
return rx_queue_count(rxq);
}
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t m = opaque[CLB_MSK_IDX];
+ const uint64_t v = opaque[CLB_VAL_IDX];
+
+ return (value & m) == v ? -1 : 0;
+}
+
int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
return -rte_errno;
}
pmc->addr = &cqe->op_own;
- pmc->val = !!idx;
- pmc->mask = MLX5_CQE_OWNER_MASK;
+ pmc->opaque[CLB_VAL_IDX] = !!idx;
+ pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+ pmc->fn = mlx_monitor_callback;
pmc->size = sizeof(uint8_t);
return 0;
}
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
* which are architecture-dependent.
*/
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ * The value read from memory.
+ * @param opaque
+ * Callback-specific data.
+ *
+ * @return
+ * 0 if entering of power optimized state should proceed
+ * -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
struct rte_power_monitor_cond {
volatile void *addr; /**< Address to monitor for changes */
- uint64_t val; /**< If the `mask` is non-zero, location pointed
- * to by `addr` will be read and compared
- * against this value.
- */
- uint64_t mask; /**< 64-bit mask to extract value read from `addr` */
- uint8_t size; /**< Data size (in bytes) that will be used to compare
- * expected value (`val`) with data read from the
+ uint8_t size; /**< Data size (in bytes) that will be read from the
* monitored memory location (`addr`). Can be 1, 2,
* 4, or 8. Supplying any other value will result in
* an error.
*/
+ rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+ * entering power optimized state should
+ * be aborted.
+ */
+ uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+ /**< Callback-specific data */
};
/**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
const unsigned int lcore_id = rte_lcore_id();
struct power_wait_status *s;
+ uint64_t cur_value;
/* prevent user from running this instruction if it's not supported */
if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
if (__check_val_size(pmc->size) < 0)
return -EINVAL;
+ if (pmc->fn == NULL)
+ return -EINVAL;
+
s = &wait_status[lcore_id];
/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
/* now that we've put this address into monitor, we can unlock */
rte_spinlock_unlock(&s->lock);
- /* if we have a comparison mask, we might not need to sleep at all */
- if (pmc->mask) {
- const uint64_t cur_value = __get_umwait_val(
- pmc->addr, pmc->size);
- const uint64_t masked = cur_value & pmc->mask;
+ cur_value = __get_umwait_val(pmc->addr, pmc->size);
- /* if the masked value is already matching, abort */
- if (masked == pmc->val)
- goto end;
- }
+ /* check if callback indicates we should abort */
+ if (pmc->fn(cur_value, pmc->opaque) != 0)
+ goto end;
/* execute UMWAIT */
asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v8 1/7] power_intrinsics: use callbacks for comparison
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-07-08 16:56 ` McDaniel, Timothy
2021-07-09 13:46 ` Thomas Monjalon
1 sibling, 0 replies; 165+ messages in thread
From: McDaniel, Timothy @ 2021-07-08 16:56 UTC (permalink / raw)
To: Burakov, Anatoly, dev, Xing, Beilei, Wu, Jingjing, Yang, Qiming,
Zhang, Qi Z, Wang, Haiyue, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Richardson, Bruce, Ananyev, Konstantin
Cc: Loftus, Ciara, Hunt, David
> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Thursday, July 8, 2021 9:14 AM
> To: dev@dpdk.org; McDaniel, Timothy <timothy.mcdaniel@intel.com>; Xing,
> Beilei <beilei.xing@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Yang,
> Qiming <qiming.yang@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>; Wang,
> Haiyue <haiyue.wang@intel.com>; Matan Azrad <matan@nvidia.com>; Shahaf
> Shuler <shahafs@nvidia.com>; Viacheslav Ovsiienko <viacheslavo@nvidia.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Cc: Loftus, Ciara <ciara.loftus@intel.com>; Hunt, David <david.hunt@intel.com>
> Subject: [PATCH v8 1/7] power_intrinsics: use callbacks for comparison
>
> Previously, the semantics of power monitor were such that we were
> checking current value against the expected value, and if they matched,
> then the sleep was aborted. This is somewhat inflexible, because it only
> allowed us to check for a specific value in a specific way.
>
> This commit replaces the comparison with a user callback mechanism, so
> that any PMD (or other code) using `rte_power_monitor()` can define
> their own comparison semantics and decision making on how to detect the
> need to abort the entering of power optimized state.
>
> Existing implementations are adjusted to follow the new semantics.
>
> Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>
> Notes:
> v4:
> - Return error if callback is set to NULL
> - Replace raw number with a macro in monitor condition opaque data
>
> v2:
> - Use callback mechanism for more flexibility
> - Address feedback from Konstantin
>
> doc/guides/rel_notes/release_21_08.rst | 2 ++
> drivers/event/dlb2/dlb2.c | 17 ++++++++--
> drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
> drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
> drivers/net/ice/ice_rxtx.c | 20 +++++++----
> drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
> drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
> .../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
> lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
> 9 files changed, 122 insertions(+), 44 deletions(-)
>
> diff --git a/doc/guides/rel_notes/release_21_08.rst
> b/doc/guides/rel_notes/release_21_08.rst
> index c92e016783..65910de348 100644
> --- a/doc/guides/rel_notes/release_21_08.rst
> +++ b/doc/guides/rel_notes/release_21_08.rst
> @@ -135,6 +135,8 @@ API Changes
> * eal: ``rte_strscpy`` sets ``rte_errno`` to ``E2BIG`` in case of string
> truncation.
>
> +* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
> +
>
> ABI Changes
> -----------
> diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
> index eca183753f..252bbd8d5e 100644
> --- a/drivers/event/dlb2/dlb2.c
> +++ b/drivers/event/dlb2/dlb2.c
> @@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port,
> int num)
> }
> }
>
> +#define CLB_MASK_IDX 0
> +#define CLB_VAL_IDX 1
> +static int
> +dlb2_monitor_callback(const uint64_t val,
> + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
> +{
> + /* abort if the value matches */
> + return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 :
> 0;
> +}
> +
> static inline int
> dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
> struct dlb2_eventdev_port *ev_port,
> @@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
> expected_value = 0;
>
> pmc.addr = monitor_addr;
> - pmc.val = expected_value;
> - pmc.mask = qe_mask.raw_qe[1];
> + /* store expected value and comparison mask in opaque data */
> + pmc.opaque[CLB_VAL_IDX] = expected_value;
> + pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
> + /* set up callback */
> + pmc.fn = dlb2_monitor_callback;
> pmc.size = sizeof(uint64_t);
>
> rte_power_monitor(&pmc, timeout + start_ticks);
> diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
> index 8d65f287f4..65f325ede1 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -81,6 +81,18 @@
> #define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
> (PKT_TX_OFFLOAD_MASK ^
> I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
>
> +static int
> +i40e_monitor_callback(const uint64_t value,
> + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ]
> __rte_unused)
> +{
> + const uint64_t m = rte_cpu_to_le_64(1 <<
> I40E_RX_DESC_STATUS_DD_SHIFT);
> + /*
> + * we expect the DD bit to be set to 1 if this descriptor was already
> + * written to.
> + */
> + return (value & m) == m ? -1 : 0;
> +}
> +
> int
> i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond
> *pmc)
> {
> @@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct
> rte_power_monitor_cond *pmc)
> /* watch for changes in status bit */
> pmc->addr = &rxdp->wb.qword1.status_error_len;
>
> - /*
> - * we expect the DD bit to be set to 1 if this descriptor was already
> - * written to.
> - */
> - pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> - pmc->mask = rte_cpu_to_le_64(1 <<
> I40E_RX_DESC_STATUS_DD_SHIFT);
> + /* comparison callback */
> + pmc->fn = i40e_monitor_callback;
>
> /* registers are 64-bit */
> pmc->size = sizeof(uint64_t);
> diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
> index f817fbc49b..d61b32fcee 100644
> --- a/drivers/net/iavf/iavf_rxtx.c
> +++ b/drivers/net/iavf/iavf_rxtx.c
> @@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
> rxdid_map[flex_type] :
> IAVF_RXDID_COMMS_OVS_1;
> }
>
> +static int
> +iavf_monitor_callback(const uint64_t value,
> + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ]
> __rte_unused)
> +{
> + const uint64_t m = rte_cpu_to_le_64(1 <<
> IAVF_RX_DESC_STATUS_DD_SHIFT);
> + /*
> + * we expect the DD bit to be set to 1 if this descriptor was already
> + * written to.
> + */
> + return (value & m) == m ? -1 : 0;
> +}
> +
> int
> iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
> {
> @@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct
> rte_power_monitor_cond *pmc)
> /* watch for changes in status bit */
> pmc->addr = &rxdp->wb.qword1.status_error_len;
>
> - /*
> - * we expect the DD bit to be set to 1 if this descriptor was already
> - * written to.
> - */
> - pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
> - pmc->mask = rte_cpu_to_le_64(1 <<
> IAVF_RX_DESC_STATUS_DD_SHIFT);
> + /* comparison callback */
> + pmc->fn = iavf_monitor_callback;
>
> /* registers are 64-bit */
> pmc->size = sizeof(uint64_t);
> diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
> index 3f6e735984..5d7ab4f047 100644
> --- a/drivers/net/ice/ice_rxtx.c
> +++ b/drivers/net/ice/ice_rxtx.c
> @@ -27,6 +27,18 @@ uint64_t
> rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
> uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
> uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
>
> +static int
> +ice_monitor_callback(const uint64_t value,
> + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ]
> __rte_unused)
> +{
> + const uint64_t m = rte_cpu_to_le_16(1 <<
> ICE_RX_FLEX_DESC_STATUS0_DD_S);
> + /*
> + * we expect the DD bit to be set to 1 if this descriptor was already
> + * written to.
> + */
> + return (value & m) == m ? -1 : 0;
> +}
> +
> int
> ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
> {
> @@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct
> rte_power_monitor_cond *pmc)
> /* watch for changes in status bit */
> pmc->addr = &rxdp->wb.status_error0;
>
> - /*
> - * we expect the DD bit to be set to 1 if this descriptor was already
> - * written to.
> - */
> - pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
> - pmc->mask = rte_cpu_to_le_16(1 <<
> ICE_RX_FLEX_DESC_STATUS0_DD_S);
> + /* comparison callback */
> + pmc->fn = ice_monitor_callback;
>
> /* register is 16-bit */
> pmc->size = sizeof(uint16_t);
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index d69f36e977..c814a28cb4 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -1369,6 +1369,18 @@ const uint32_t
> RTE_PTYPE_INNER_L3_IPV4_EXT |
> RTE_PTYPE_INNER_L4_UDP,
> };
>
> +static int
> +ixgbe_monitor_callback(const uint64_t value,
> + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ]
> __rte_unused)
> +{
> + const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> + /*
> + * we expect the DD bit to be set to 1 if this descriptor was already
> + * written to.
> + */
> + return (value & m) == m ? -1 : 0;
> +}
> +
> int
> ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond
> *pmc)
> {
> @@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct
> rte_power_monitor_cond *pmc)
> /* watch for changes in status bit */
> pmc->addr = &rxdp->wb.upper.status_error;
>
> - /*
> - * we expect the DD bit to be set to 1 if this descriptor was already
> - * written to.
> - */
> - pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> - pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
> + /* comparison callback */
> + pmc->fn = ixgbe_monitor_callback;
>
> /* the registers are 32-bit */
> pmc->size = sizeof(uint32_t);
> diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
> index 777a1d6e45..17370b77dc 100644
> --- a/drivers/net/mlx5/mlx5_rx.c
> +++ b/drivers/net/mlx5/mlx5_rx.c
> @@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev,
> uint16_t rx_queue_id)
> return rx_queue_count(rxq);
> }
>
> +#define CLB_VAL_IDX 0
> +#define CLB_MSK_IDX 1
> +static int
> +mlx_monitor_callback(const uint64_t value,
> + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
> +{
> + const uint64_t m = opaque[CLB_MSK_IDX];
> + const uint64_t v = opaque[CLB_VAL_IDX];
> +
> + return (value & m) == v ? -1 : 0;
> +}
> +
> int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond
> *pmc)
> {
> struct mlx5_rxq_data *rxq = rx_queue;
> @@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct
> rte_power_monitor_cond *pmc)
> return -rte_errno;
> }
> pmc->addr = &cqe->op_own;
> - pmc->val = !!idx;
> - pmc->mask = MLX5_CQE_OWNER_MASK;
> + pmc->opaque[CLB_VAL_IDX] = !!idx;
> + pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
> + pmc->fn = mlx_monitor_callback;
> pmc->size = sizeof(uint8_t);
> return 0;
> }
> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h
> b/lib/eal/include/generic/rte_power_intrinsics.h
> index dddca3d41c..c9aa52a86d 100644
> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> @@ -18,19 +18,38 @@
> * which are architecture-dependent.
> */
>
> +/** Size of the opaque data in monitor condition */
> +#define RTE_POWER_MONITOR_OPAQUE_SZ 4
> +
> +/**
> + * Callback definition for monitoring conditions. Callbacks with this signature
> + * will be used by `rte_power_monitor()` to check if the entering of power
> + * optimized state should be aborted.
> + *
> + * @param val
> + * The value read from memory.
> + * @param opaque
> + * Callback-specific data.
> + *
> + * @return
> + * 0 if entering of power optimized state should proceed
> + * -1 if entering of power optimized state should be aborted
> + */
> +typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
> + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
> struct rte_power_monitor_cond {
> volatile void *addr; /**< Address to monitor for changes */
> - uint64_t val; /**< If the `mask` is non-zero, location pointed
> - * to by `addr` will be read and compared
> - * against this value.
> - */
> - uint64_t mask; /**< 64-bit mask to extract value read from `addr` */
> - uint8_t size; /**< Data size (in bytes) that will be used to compare
> - * expected value (`val`) with data read from the
> + uint8_t size; /**< Data size (in bytes) that will be read from the
> * monitored memory location (`addr`). Can be 1, 2,
> * 4, or 8. Supplying any other value will result in
> * an error.
> */
> + rte_power_monitor_clb_t fn; /**< Callback to be used to check if
> + * entering power optimized state should
> + * be aborted.
> + */
> + uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
> + /**< Callback-specific data */
> };
>
> /**
> diff --git a/lib/eal/x86/rte_power_intrinsics.c
> b/lib/eal/x86/rte_power_intrinsics.c
> index 39ea9fdecd..66fea28897 100644
> --- a/lib/eal/x86/rte_power_intrinsics.c
> +++ b/lib/eal/x86/rte_power_intrinsics.c
> @@ -76,6 +76,7 @@ rte_power_monitor(const struct
> rte_power_monitor_cond *pmc,
> const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> const unsigned int lcore_id = rte_lcore_id();
> struct power_wait_status *s;
> + uint64_t cur_value;
>
> /* prevent user from running this instruction if it's not supported */
> if (!wait_supported)
> @@ -91,6 +92,9 @@ rte_power_monitor(const struct
> rte_power_monitor_cond *pmc,
> if (__check_val_size(pmc->size) < 0)
> return -EINVAL;
>
> + if (pmc->fn == NULL)
> + return -EINVAL;
> +
> s = &wait_status[lcore_id];
>
> /* update sleep address */
> @@ -110,16 +114,11 @@ rte_power_monitor(const struct
> rte_power_monitor_cond *pmc,
> /* now that we've put this address into monitor, we can unlock */
> rte_spinlock_unlock(&s->lock);
>
> - /* if we have a comparison mask, we might not need to sleep at all */
> - if (pmc->mask) {
> - const uint64_t cur_value = __get_umwait_val(
> - pmc->addr, pmc->size);
> - const uint64_t masked = cur_value & pmc->mask;
> + cur_value = __get_umwait_val(pmc->addr, pmc->size);
>
> - /* if the masked value is already matching, abort */
> - if (masked == pmc->val)
> - goto end;
> - }
> + /* check if callback indicates we should abort */
> + if (pmc->fn(cur_value, pmc->opaque) != 0)
> + goto end;
>
> /* execute UMWAIT */
> asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
> --
> 2.25.1
DLB changes look good to me
Acked-by: timothy.mcdaniel@intel.com
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v8 1/7] power_intrinsics: use callbacks for comparison
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
2021-07-08 16:56 ` McDaniel, Timothy
@ 2021-07-09 13:46 ` Thomas Monjalon
2021-07-09 14:41 ` Burakov, Anatoly
1 sibling, 1 reply; 165+ messages in thread
From: Thomas Monjalon @ 2021-07-09 13:46 UTC (permalink / raw)
To: Anatoly Burakov
Cc: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev,
ciara.loftus, david.hunt, david.marchand
08/07/2021 16:13, Anatoly Burakov:
> doc/guides/rel_notes/release_21_08.rst | 2 ++
> drivers/event/dlb2/dlb2.c | 17 ++++++++--
> drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
> drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
> drivers/net/ice/ice_rxtx.c | 20 +++++++----
> drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
> drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
> .../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
> lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
> 9 files changed, 122 insertions(+), 44 deletions(-)
About the title, it is introducing a new prefix "power_intrinsics:"
with is not so much descriptive.
Probably better to formulate with "eal:" prefix.
> --- a/drivers/net/mlx5/mlx5_rx.c
> +++ b/drivers/net/mlx5/mlx5_rx.c
> @@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
> return rx_queue_count(rxq);
> }
>
> +#define CLB_VAL_IDX 0
> +#define CLB_MSK_IDX 1
> +static int
> +mlx_monitor_callback(const uint64_t value,
> + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
Everything is prefixed with mlx5, let's be consistent.
Please replace mlx_ with mlx5_
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v8 1/7] power_intrinsics: use callbacks for comparison
2021-07-09 13:46 ` Thomas Monjalon
@ 2021-07-09 14:41 ` Burakov, Anatoly
0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-09 14:41 UTC (permalink / raw)
To: Thomas Monjalon
Cc: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev,
ciara.loftus, david.hunt, david.marchand
On 09-Jul-21 2:46 PM, Thomas Monjalon wrote:
> 08/07/2021 16:13, Anatoly Burakov:
>> doc/guides/rel_notes/release_21_08.rst | 2 ++
>> drivers/event/dlb2/dlb2.c | 17 ++++++++--
>> drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
>> drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
>> drivers/net/ice/ice_rxtx.c | 20 +++++++----
>> drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
>> drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
>> .../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
>> lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
>> 9 files changed, 122 insertions(+), 44 deletions(-)
>
> About the title, it is introducing a new prefix "power_intrinsics:"
> with is not so much descriptive.
> Probably better to formulate with "eal:" prefix.
>
>
>> --- a/drivers/net/mlx5/mlx5_rx.c
>> +++ b/drivers/net/mlx5/mlx5_rx.c
>> @@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
>> return rx_queue_count(rxq);
>> }
>>
>> +#define CLB_VAL_IDX 0
>> +#define CLB_MSK_IDX 1
>> +static int
>> +mlx_monitor_callback(const uint64_t value,
>> + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
>
> Everything is prefixed with mlx5, let's be consistent.
> Please replace mlx_ with mlx5_
>
Sure, will fix.
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v8 2/7] net/af_xdp: add power monitor support
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management Anatoly Burakov
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-07-08 14:13 ` Anatoly Burakov
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 3/7] eal: add power monitor for multiple events Anatoly Burakov
` (5 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-08 14:13 UTC (permalink / raw)
To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt, konstantin.ananyev
Implement support for .get_monitor_addr in AF_XDP driver.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v8:
- Fix checkpatch issue
v2:
- Rewrite using the callback mechanism
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..989051dd6d 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
#include <rte_malloc.h>
#include <rte_ring.h>
#include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
#include "compat.h"
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
return 0;
}
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t v = opaque[CLB_VAL_IDX];
+ const uint64_t m = (uint32_t)~0;
+
+ /* if the value has changed, abort entering power optimized state */
+ return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+ struct pkt_rx_queue *rxq = rx_queue;
+ unsigned int *prod = rxq->rx.producer;
+ const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+ /* watch for changes in producer ring */
+ pmc->addr = (void *)prod;
+
+ /* store current value */
+ pmc->opaque[CLB_VAL_IDX] = cur_val;
+ pmc->fn = eth_monitor_callback;
+
+ /* AF_XDP producer ring index is 32-bit */
+ pmc->size = sizeof(uint32_t);
+
+ return 0;
+}
+
static int
eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
{
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
.link_update = eth_link_update,
.stats_get = eth_stats_get,
.stats_reset = eth_stats_reset,
+ .get_monitor_addr = eth_get_monitor_addr
};
/** parse busy_budget argument */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v8 3/7] eal: add power monitor for multiple events
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management Anatoly Burakov
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-07-08 14:13 ` Anatoly Burakov
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
` (4 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-08 14:13 UTC (permalink / raw)
To: dev, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
Cc: ciara.loftus, david.hunt
Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v4:
- Fixed bugs in accessing the monitor condition
- Abort on any monitor condition not having a defined callback
v2:
- Adapt to callback mechanism
lib/eal/arm/rte_power_intrinsics.c | 11 +++
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 35 +++++++++
lib/eal/ppc/rte_power_intrinsics.c | 11 +++
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++
7 files changed, 137 insertions(+)
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
/**< indicates support for rte_power_monitor function */
uint32_t power_pause : 1;
/**< indicates support for rte_power_pause function */
+ uint32_t power_monitor_multi : 1;
+ /**< indicates support for rte_power_monitor_multi function */
};
/**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
__rte_experimental
int rte_power_pause(const uint64_t tsc_timestamp);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ * supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ * Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ * An array of monitoring condition structures.
+ * @param num
+ * Length of the `pmc` array.
+ * @param tsc_timestamp
+ * Maximum TSC timestamp to wait for. Note that the wait behavior is
+ * architecture-dependent.
+ *
+ * @return
+ * 0 on success
+ * -EINVAL on invalid parameters
+ * -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp);
+
#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 2df65c6903..887012d02a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
rte_version_release; # WINDOWS_NO_EXPORT
rte_version_suffix; # WINDOWS_NO_EXPORT
rte_version_year; # WINDOWS_NO_EXPORT
+
+ # added in 21.08
+ rte_power_monitor_multi; # WINDOWS_NO_EXPORT
};
INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
intrinsics->power_monitor = 1;
intrinsics->power_pause = 1;
+ if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+ intrinsics->power_monitor_multi = 1;
}
}
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
#include <rte_common.h>
#include <rte_lcore.h>
+#include <rte_rtm.h>
#include <rte_spinlock.h>
#include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
}
static bool wait_supported;
+static bool wait_multi_supported;
static inline uint64_t
__get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
if (i.power_monitor && i.power_pause)
wait_supported = 1;
+ if (i.power_monitor_multi)
+ wait_multi_supported = 1;
}
int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
* In this case, since we've already woken up, the "wakeup" was
* unneeded, and since T1 is still waiting on T2 releasing the lock, the
* wakeup address is still valid so it's perfectly safe to write it.
+ *
+ * For multi-monitor case, the act of locking will in itself trigger the
+ * wakeup, so no additional writes necessary.
*/
rte_spinlock_lock(&s->lock);
if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return 0;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ const unsigned int lcore_id = rte_lcore_id();
+ struct power_wait_status *s = &wait_status[lcore_id];
+ uint32_t i, rc;
+
+ /* check if supported */
+ if (!wait_multi_supported)
+ return -ENOTSUP;
+
+ if (pmc == NULL || num == 0)
+ return -EINVAL;
+
+ /* we are already inside transaction region, return */
+ if (rte_xtest() != 0)
+ return 0;
+
+ /* start new transaction region */
+ rc = rte_xbegin();
+
+ /* transaction abort, possible write to one of wait addresses */
+ if (rc != RTE_XBEGIN_STARTED)
+ return 0;
+
+ /*
+ * the mere act of reading the lock status here adds the lock to
+ * the read set. This means that when we trigger a wakeup from another
+ * thread, even if we don't have a defined wakeup address and thus don't
+ * actually cause any writes, the act of locking our lock will itself
+ * trigger the wakeup and abort the transaction.
+ */
+ rte_spinlock_is_locked(&s->lock);
+
+ /*
+ * add all addresses to wait on into transaction read-set and check if
+ * any of wakeup conditions are already met.
+ */
+ rc = 0;
+ for (i = 0; i < num; i++) {
+ const struct rte_power_monitor_cond *c = &pmc[i];
+
+ /* cannot be NULL */
+ if (c->fn == NULL) {
+ rc = -EINVAL;
+ break;
+ }
+
+ const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+ /* abort if callback indicates that we need to stop */
+ if (c->fn(val, c->opaque) != 0)
+ break;
+ }
+
+ /* none of the conditions were met, sleep until timeout */
+ if (i == num)
+ rte_power_pause(tsc_timestamp);
+
+ /* end transaction region */
+ rte_xend();
+
+ return rc;
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v8 4/7] power: remove thread safety from PMD power API's
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management Anatoly Burakov
` (2 preceding siblings ...)
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-07-08 14:13 ` Anatoly Burakov
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
` (3 subsequent siblings)
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-08 14:13 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.
We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v2:
- Add check for stopped queue
- Clarified doc message
- Added release notes
doc/guides/rel_notes/release_21_08.rst | 4 +
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 133 ++++++++++---------------
lib/power/rte_power_pmd_mgmt.h | 6 ++
4 files changed, 66 insertions(+), 80 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 65910de348..33e66d746b 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -137,6 +137,10 @@ API Changes
* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+* rte_power: The experimental PMD power management API is no longer considered
+ to be thread safe; all Rx queues affected by the API will now need to be
+ stopped before making any changes to the power management scheme.
+
ABI Changes
-----------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
'rte_power_pmd_mgmt.h',
'rte_power_guest_channel.h',
)
+if cc.has_argument('-Wno-cast-qual')
+ cflags += '-Wno-cast-qual'
+endif
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
/**< Callback mode for this queue */
const struct rte_eth_rxtx_callback *cur_cb;
/**< Callback instance */
- volatile bool umwait_in_progress;
- /**< are we currently sleeping? */
uint64_t empty_poll_stats;
/**< Number of empty polls */
} __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
struct rte_power_monitor_cond pmc;
uint16_t ret;
- /*
- * we might get a cancellation request while being
- * inside the callback, in which case the wakeup
- * wouldn't work because it would've arrived too early.
- *
- * to get around this, we notify the other thread that
- * we're sleeping, so that it can spin until we're done.
- * unsolicited wakeups are perfectly safe.
- */
- q_conf->umwait_in_progress = true;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- /* check if we need to cancel sleep */
- if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
- /* use monitoring condition to sleep */
- ret = rte_eth_get_monitor_addr(port_id, qidx,
- &pmc);
- if (ret == 0)
- rte_power_monitor(&pmc, UINT64_MAX);
- }
- q_conf->umwait_in_progress = false;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+ /* use monitoring condition to sleep */
+ ret = rte_eth_get_monitor_addr(port_id, qidx,
+ &pmc);
+ if (ret == 0)
+ rte_power_monitor(&pmc, UINT64_MAX);
}
} else
q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
return nb_rx;
}
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+ struct rte_eth_rxq_info qinfo;
+
+ if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+ return -1;
+
+ return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
struct pmd_queue_cfg *queue_cfg;
struct rte_eth_dev_info info;
+ rte_rx_callback_fn clb;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
queue_cfg = &port_cfg[port_id][queue_id];
if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->umwait_in_progress = false;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* ensure we update our state before callback starts */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_umwait, NULL);
+ clb = clb_umwait;
break;
}
case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
- queue_id, clb_scale_freq, NULL);
+ clb = clb_scale_freq;
break;
}
case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (global_data.tsc_per_us == 0)
calc_tsc();
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_pause, NULL);
+ clb = clb_pause;
break;
+ default:
+ RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+ ret = -EINVAL;
+ goto end;
}
+
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, NULL);
+
ret = 0;
end:
return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
struct pmd_queue_cfg *queue_cfg;
+ int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
return -EINVAL;
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
/* no need to check queue id as wrong queue id would not be enabled */
queue_cfg = &port_cfg[port_id][queue_id];
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
/* stop any callbacks from progressing */
queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
- /* ensure we update our state before continuing */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
switch (queue_cfg->cb_mode) {
- case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- bool exit = false;
- do {
- /*
- * we may request cancellation while the other thread
- * has just entered the callback but hasn't started
- * sleeping yet, so keep waking it up until we know it's
- * done sleeping.
- */
- if (queue_cfg->umwait_in_progress)
- rte_power_monitor_wakeup(lcore_id);
- else
- exit = true;
- } while (!exit);
- }
- /* fall-through */
+ case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
rte_eth_remove_rx_callback(port_id, queue_id,
queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
break;
}
/*
- * we don't free the RX callback here because it is unsafe to do so
- * unless we know for a fact that all data plane threads have stopped.
+ * the API doc mandates that the user stops all processing on affected
+ * ports before calling any of these API's, so we can assume that the
+ * callbacks can be freed. we're intentionally casting away const-ness.
*/
- queue_cfg->cur_cb = NULL;
+ rte_free((void *)queue_cfg->cur_cb);
return 0;
}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue will be polled from.
* @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue is polled from.
* @param port_id
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v8 5/7] power: support callbacks for multiple Rx queues
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management Anatoly Burakov
` (3 preceding siblings ...)
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-07-08 14:13 ` Anatoly Burakov
2021-07-09 14:24 ` David Marchand
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 6/7] power: support monitoring " Anatoly Burakov
` (2 subsequent siblings)
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-08 14:13 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.
The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:
- Replace per-queue structures with per-lcore ones, so that any device
polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
added to the list of queues to poll, so that the callback is aware of
other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
shared between all queues polled on a particular lcore, and is only
activated when all queues in the list were polled and were determined
to have no traffic.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
is incapable of monitoring more than one address.
Also, while we're at it, update and improve the docs.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v8:
- Added a comment explaining that we want to sleep on each empty poll after
threshold has been reached
v7:
- Fix bug where initial sleep target was always set to zero
- Fix logic in handling of n_queues_ready_to_sleep counter
- Update documentation on hardware requirements
v6:
- Track each individual queue sleep status (Konstantin)
- Fix segfault (Dave)
v5:
- Remove the "power save queue" API and replace it with mechanism suggested by
Konstantin
v3:
- Move the list of supported NICs to NIC feature table
v2:
- Use a TAILQ for queues instead of a static array
- Address feedback from Konstantin
- Add additional checks for stopped queues
doc/guides/nics/features.rst | 10 +
doc/guides/prog_guide/power_man.rst | 69 ++--
doc/guides/rel_notes/release_21_08.rst | 3 +
lib/power/rte_power_pmd_mgmt.c | 460 +++++++++++++++++++------
4 files changed, 406 insertions(+), 136 deletions(-)
diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
* **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
* **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
.. _nic_features_other:
Other dev ops not represented by a Feature
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..0e66878892 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,45 @@ Ethernet PMD Power Management API
Abstract
~~~~~~~~
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
- This power saving scheme will put the CPU into optimized power state
- and use the ``rte_power_monitor()`` function
- to monitor the Ethernet PMD RX descriptor address,
- and wake the CPU up whenever there's new traffic.
-
-Pause
- This power saving scheme will avoid busy polling
- by either entering power-optimized sleep state
- with ``rte_power_pause()`` function,
- or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
- This power saving scheme will use ``librte_power`` library
- functionality to scale the core frequency up/down
- depending on traffic volume.
-
-.. note::
-
- Currently, this power management API is limited to mandatory mapping
- of 1 queue to 1 core (multiple queues are supported,
- but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+ This power saving scheme will put the CPU into optimized power state and
+ monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+ there's new traffic. Support for this scheme may not be available on all
+ platforms, and further limitations may apply (see below).
+
+* Pause
+ This power saving scheme will avoid busy polling by either entering
+ power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+ not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+ This power saving scheme will use ``librte_power`` library functionality to
+ scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* On Linux* x86_64, `rte_power_monitor()` requires WAITPKG instruction set being
+ supported by the CPU. Please refer to your platform documentation for further
+ information.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+ limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+ monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+ ``rte_power_monitor()`` function is not supported, then monitor mode will not
+ be supported.
+
+* Not all Ethernet drivers support monitoring, even if the underlying
+ platform may support the necessary CPU instructions. Please refer to
+ :doc:`../nics/overview` for more information.
+
API Overview for Ethernet PMD Power Management
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -242,3 +253,5 @@ References
* The :doc:`../sample_app_ug/vm_power_management`
chapter in the :doc:`../sample_app_ug/index` section.
+
+* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 33e66d746b..181a767b5e 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -103,6 +103,9 @@ New Features
usecases. Configuration happens via standard rawdev enq/deq operations. See
the :doc:`../rawdevs/cnxk_bphy` rawdev guide for more details on this driver.
+* rte_power: The experimental PMD power management API now supports managing
+ multiple Ethernet Rx queues per lcore.
+
Removed Items
-------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..30772791af 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,18 +33,98 @@ enum pmd_mgmt_state {
PMD_MGMT_ENABLED
};
-struct pmd_queue_cfg {
+union queue {
+ uint32_t val;
+ struct {
+ uint16_t portid;
+ uint16_t qid;
+ };
+};
+
+struct queue_list_entry {
+ TAILQ_ENTRY(queue_list_entry) next;
+ union queue queue;
+ uint64_t n_empty_polls;
+ uint64_t n_sleeps;
+ const struct rte_eth_rxtx_callback *cb;
+};
+
+struct pmd_core_cfg {
+ TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+ /**< List of queues associated with this lcore */
+ size_t n_queues;
+ /**< How many queues are in the list? */
volatile enum pmd_mgmt_state pwr_mgmt_state;
/**< State of power management for this queue */
enum rte_power_pmd_mgmt_type cb_mode;
/**< Callback mode for this queue */
- const struct rte_eth_rxtx_callback *cur_cb;
- /**< Callback instance */
- uint64_t empty_poll_stats;
- /**< Number of empty polls */
+ uint64_t n_queues_ready_to_sleep;
+ /**< Number of queues ready to enter power optimized state */
+ uint64_t sleep_target;
+ /**< Prevent a queue from triggering sleep multiple times */
} __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+ return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+ dst->val = src->val;
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *cur;
+
+ TAILQ_FOREACH(cur, &cfg->head, next) {
+ if (queue_equal(&cur->queue, q))
+ return cur;
+ }
+ return NULL;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *qle;
+
+ /* is it already in the list? */
+ if (queue_list_find(cfg, q) != NULL)
+ return -EEXIST;
+
+ qle = malloc(sizeof(*qle));
+ if (qle == NULL)
+ return -ENOMEM;
+ memset(qle, 0, sizeof(*qle));
+
+ queue_copy(&qle->queue, q);
+ TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+ cfg->n_queues++;
+
+ return 0;
+}
+
+static struct queue_list_entry *
+queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *found;
+
+ found = queue_list_find(cfg, q);
+ if (found == NULL)
+ return NULL;
+
+ TAILQ_REMOVE(&cfg->head, found, next);
+ cfg->n_queues--;
+
+ /* freeing is responsibility of the caller */
+ return found;
+}
static void
calc_tsc(void)
@@ -74,21 +154,79 @@ calc_tsc(void)
}
}
+static inline void
+queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ const bool is_ready_to_sleep = qcfg->n_sleeps == cfg->sleep_target;
+
+ /* reset empty poll counter for this queue */
+ qcfg->n_empty_polls = 0;
+ /* reset the queue sleep counter as well */
+ qcfg->n_sleeps = 0;
+ /* remove the queue from list of queues ready to sleep */
+ if (is_ready_to_sleep)
+ cfg->n_queues_ready_to_sleep--;
+ /*
+ * no need change the lcore sleep target counter because this lcore will
+ * reach the n_sleeps anyway, and the other cores are already counted so
+ * there's no need to do anything else.
+ */
+}
+
+static inline bool
+queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ /* this function is called - that means we have an empty poll */
+ qcfg->n_empty_polls++;
+
+ /* if we haven't reached threshold for empty polls, we can't sleep */
+ if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
+ return false;
+
+ /*
+ * we've reached a point where we are able to sleep, but we still need
+ * to check if this queue has already been marked for sleeping.
+ */
+ if (qcfg->n_sleeps == cfg->sleep_target)
+ return true;
+
+ /* mark this queue as ready for sleep */
+ qcfg->n_sleeps = cfg->sleep_target;
+ cfg->n_queues_ready_to_sleep++;
+
+ return true;
+}
+
+static inline bool
+lcore_can_sleep(struct pmd_core_cfg *cfg)
+{
+ /* are all queues ready to sleep? */
+ if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
+ return false;
+
+ /* we've reached an iteration where we can sleep, reset sleep counter */
+ cfg->n_queues_ready_to_sleep = 0;
+ cfg->sleep_target++;
+ /*
+ * we do not reset any individual queue empty poll counters, because
+ * we want to keep sleeping on every poll until we actually get traffic.
+ */
+
+ return true;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+ uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
{
+ struct queue_list_entry *queue_conf = arg;
- struct pmd_queue_cfg *q_conf;
-
- q_conf = &port_cfg[port_id][qidx];
-
+ /* this callback can't do more than one queue, omit multiqueue logic */
if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+ queue_conf->n_empty_polls++;
+ if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
struct rte_power_monitor_cond pmc;
- uint16_t ret;
+ int ret;
/* use monitoring condition to sleep */
ret = rte_eth_get_monitor_addr(port_id, qidx,
@@ -97,60 +235,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
rte_power_monitor(&pmc, UINT64_MAX);
}
} else
- q_conf->empty_poll_stats = 0;
+ queue_conf->n_empty_polls = 0;
return nb_rx;
}
static uint16_t
-clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
- q_conf = &port_cfg[port_id][qidx];
+ lcore_conf = &lcore_cfgs[lcore];
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- /* sleep for 1 microsecond */
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
- /* use tpause if we have it */
- if (global_data.intrinsics_support.power_pause) {
- const uint64_t cur = rte_rdtsc();
- const uint64_t wait_tsc =
- cur + global_data.tsc_per_us;
- rte_power_pause(wait_tsc);
- } else {
- uint64_t i;
- for (i = 0; i < global_data.pause_per_us; i++)
- rte_pause();
- }
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* sleep for 1 microsecond, use tpause if we have it */
+ if (global_data.intrinsics_support.power_pause) {
+ const uint64_t cur = rte_rdtsc();
+ const uint64_t wait_tsc =
+ cur + global_data.tsc_per_us;
+ rte_power_pause(wait_tsc);
+ } else {
+ uint64_t i;
+ for (i = 0; i < global_data.pause_per_us; i++)
+ rte_pause();
}
- } else
- q_conf->empty_poll_stats = 0;
+ }
return nb_rx;
}
static uint16_t
-clb_scale_freq(uint16_t port_id, uint16_t qidx,
+clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
- uint16_t max_pkts __rte_unused, void *_ __rte_unused)
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ const bool empty = nb_rx == 0;
+ struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+ struct queue_list_entry *queue_conf = arg;
- q_conf = &port_cfg[port_id][qidx];
+ if (likely(!empty)) {
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
- /* scale down freq */
- rte_power_freq_min(rte_lcore_id());
- } else {
- q_conf->empty_poll_stats = 0;
- /* scale up freq */
+ /* scale up freq immediately */
rte_power_freq_max(rte_lcore_id());
+ } else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ rte_power_freq_min(rte_lcore_id());
}
return nb_rx;
@@ -167,11 +322,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
}
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+ const struct queue_list_entry *entry;
+
+ TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+ const union queue *q = &entry->queue;
+ int ret = queue_stopped(q->portid, q->qid);
+ if (ret != 1)
+ return ret;
+ }
+ return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+ enum power_management_env env;
+
+ /* only PSTATE and ACPI modes are supported */
+ if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+ !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+ return -ENOTSUP;
+ }
+ /* ensure we could initialize the power library */
+ if (rte_power_init(lcore))
+ return -EINVAL;
+
+ /* ensure we initialized the correct env */
+ env = rte_power_get_env();
+ if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+ struct rte_power_monitor_cond dummy;
+
+ /* check if rte_power_monitor is supported */
+ if (!global_data.intrinsics_support.power_monitor) {
+ RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+ return -ENOTSUP;
+ }
+
+ if (cfg->n_queues > 0) {
+ RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+ return -ENOTSUP;
+ }
+
+ /* check if the device supports the necessary PMD API */
+ if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+ &dummy) == -ENOTSUP) {
+ RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
struct rte_eth_dev_info info;
rte_rx_callback_fn clb;
int ret;
@@ -202,9 +426,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
+ /* if callback was already enabled, check current callback type */
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+ lcore_cfg->cb_mode != mode) {
ret = -EINVAL;
goto end;
}
@@ -214,53 +448,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
switch (mode) {
case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- struct rte_power_monitor_cond dummy;
-
- /* check if rte_power_monitor is supported */
- if (!global_data.intrinsics_support.power_monitor) {
- RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_monitor(lcore_cfg, &qdata);
+ if (ret < 0)
goto end;
- }
- /* check if the device supports the necessary PMD API */
- if (rte_eth_get_monitor_addr(port_id, queue_id,
- &dummy) == -ENOTSUP) {
- RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_umwait;
break;
- }
case RTE_POWER_MGMT_TYPE_SCALE:
- {
- enum power_management_env env;
- /* only PSTATE and ACPI modes are supported */
- if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
- !rte_power_check_env_supported(
- PM_ENV_PSTATE_CPUFREQ)) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_scale(lcore_id);
+ if (ret < 0)
goto end;
- }
- /* ensure we could initialize the power library */
- if (rte_power_init(lcore_id)) {
- ret = -EINVAL;
- goto end;
- }
- /* ensure we initialized the correct env */
- env = rte_power_get_env();
- if (env != PM_ENV_ACPI_CPUFREQ &&
- env != PM_ENV_PSTATE_CPUFREQ) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_scale_freq;
break;
- }
case RTE_POWER_MGMT_TYPE_PAUSE:
/* figure out various time-to-tsc conversions */
if (global_data.tsc_per_us == 0)
@@ -273,13 +474,27 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -EINVAL;
goto end;
}
+ /* add this queue to the list */
+ ret = queue_list_add(lcore_cfg, &qdata);
+ if (ret < 0) {
+ RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+ strerror(-ret));
+ goto end;
+ }
+ /* new queue is always added last */
+ queue_cfg = TAILQ_LAST(&lcore_cfg->head, queue_list_head);
+
+ /* when enabling first queue, ensure sleep target is not 0 */
+ if (lcore_cfg->n_queues == 1 && lcore_cfg->sleep_target == 0)
+ lcore_cfg->sleep_target = 1;
/* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb, NULL);
+ if (lcore_cfg->n_queues == 1) {
+ lcore_cfg->cb_mode = mode;
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ }
+ queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, queue_cfg);
ret = 0;
end:
@@ -290,7 +505,9 @@ int
rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,24 +523,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
}
/* no need to check queue id as wrong queue id would not be enabled */
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
return -EINVAL;
- /* stop any callbacks from progressing */
- queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+ /*
+ * There is no good/easy way to do this without race conditions, so we
+ * are just going to throw our hands in the air and hope that the user
+ * has read the documentation and has ensured that ports are stopped at
+ * the time we enter the API functions.
+ */
+ queue_cfg = queue_list_take(lcore_cfg, &qdata);
+ if (queue_cfg == NULL)
+ return -ENOENT;
- switch (queue_cfg->cb_mode) {
+ /* if we've removed all queues from the lists, set state to disabled */
+ if (lcore_cfg->n_queues == 0)
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+ switch (lcore_cfg->cb_mode) {
case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
break;
case RTE_POWER_MGMT_TYPE_SCALE:
rte_power_freq_max(lcore_id);
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
rte_power_exit(lcore_id);
break;
}
@@ -332,7 +565,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
* ports before calling any of these API's, so we can assume that the
* callbacks can be freed. we're intentionally casting away const-ness.
*/
- rte_free((void *)queue_cfg->cur_cb);
+ rte_free((void *)queue_cfg->cb);
+ free(queue_cfg);
return 0;
}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+ size_t i;
+
+ /* initialize all tailqs */
+ for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
+ struct pmd_core_cfg *cfg = &lcore_cfgs[i];
+ TAILQ_INIT(&cfg->head);
+ }
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v8 5/7] power: support callbacks for multiple Rx queues
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-07-09 14:24 ` David Marchand
2021-07-09 14:42 ` Burakov, Anatoly
0 siblings, 1 reply; 165+ messages in thread
From: David Marchand @ 2021-07-09 14:24 UTC (permalink / raw)
To: Anatoly Burakov, Thomas Monjalon, Andrew Rybchenko, Yigit, Ferruh
Cc: dev, David Hunt, Ciara Loftus, Ananyev, Konstantin
On Thu, Jul 8, 2021 at 4:14 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index 403c2b03a3..a96e12d155 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
> * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
> * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
>
> +.. _nic_features_get_monitor_addr:
> +
> +PMD power management using monitor addresses
> +--------------------------------------------
> +
> +Supports getting a monitoring condition to use together with Ethernet PMD power
> +management (see :doc:`../prog_guide/power_man` for more details).
> +
> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
> +
- This new ethdev feature deserves its own commit.
- Adding ethdev maintainers.
We are missing a doc/guides/nics/features/default.ini.
The name of the features proposed here is rather long.
As far as I can see, features should be shorter than:
doc/guides/conf.py:feature_str_len = 30
I am not really inspired.. "Power mgmt address monitor"?
- pmd supporting this feature must have their .ini updated.
> .. _nic_features_other:
>
> Other dev ops not represented by a Feature
--
David Marchand
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v8 5/7] power: support callbacks for multiple Rx queues
2021-07-09 14:24 ` David Marchand
@ 2021-07-09 14:42 ` Burakov, Anatoly
2021-07-09 14:46 ` David Marchand
0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-09 14:42 UTC (permalink / raw)
To: David Marchand, Thomas Monjalon, Andrew Rybchenko, Yigit, Ferruh
Cc: dev, David Hunt, Ciara Loftus, Ananyev, Konstantin
On 09-Jul-21 3:24 PM, David Marchand wrote:
> On Thu, Jul 8, 2021 at 4:14 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
>> index 403c2b03a3..a96e12d155 100644
>> --- a/doc/guides/nics/features.rst
>> +++ b/doc/guides/nics/features.rst
>> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
>> * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
>> * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
>>
>> +.. _nic_features_get_monitor_addr:
>> +
>> +PMD power management using monitor addresses
>> +--------------------------------------------
>> +
>> +Supports getting a monitoring condition to use together with Ethernet PMD power
>> +management (see :doc:`../prog_guide/power_man` for more details).
>> +
>> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
>> +
>
> - This new ethdev feature deserves its own commit.
>
> - Adding ethdev maintainers.
>
> We are missing a doc/guides/nics/features/default.ini.
> The name of the features proposed here is rather long.
> As far as I can see, features should be shorter than:
> doc/guides/conf.py:feature_str_len = 30
> I am not really inspired.. "Power mgmt address monitor"?
>
> - pmd supporting this feature must have their .ini updated.
>
>
>> .. _nic_features_other:
>>
>> Other dev ops not represented by a Feature
>
>
>
You'd have to walk me through/translate whatever it is that you just
said as i have no idea what any of that means :D
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v8 5/7] power: support callbacks for multiple Rx queues
2021-07-09 14:42 ` Burakov, Anatoly
@ 2021-07-09 14:46 ` David Marchand
2021-07-09 14:53 ` Burakov, Anatoly
0 siblings, 1 reply; 165+ messages in thread
From: David Marchand @ 2021-07-09 14:46 UTC (permalink / raw)
To: Burakov, Anatoly
Cc: Thomas Monjalon, Andrew Rybchenko, Yigit, Ferruh, dev,
David Hunt, Ciara Loftus, Ananyev, Konstantin
On Fri, Jul 9, 2021 at 4:42 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
> You'd have to walk me through/translate whatever it is that you just
> said as i have no idea what any of that means :D
https://git.dpdk.org/dpdk/commit/?id=fa5dbd825a
--
David Marchand
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v8 5/7] power: support callbacks for multiple Rx queues
2021-07-09 14:46 ` David Marchand
@ 2021-07-09 14:53 ` Burakov, Anatoly
0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-09 14:53 UTC (permalink / raw)
To: David Marchand
Cc: Thomas Monjalon, Andrew Rybchenko, Yigit, Ferruh, dev,
David Hunt, Ciara Loftus, Ananyev, Konstantin
On 09-Jul-21 3:46 PM, David Marchand wrote:
> On Fri, Jul 9, 2021 at 4:42 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
>> You'd have to walk me through/translate whatever it is that you just
>> said as i have no idea what any of that means :D
>
> https://git.dpdk.org/dpdk/commit/?id=fa5dbd825a
>
>
Thanks, i'll separate it out into a separate commit and try to come up
with a good name for it!
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v8 6/7] power: support monitoring multiple Rx queues
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management Anatoly Burakov
` (4 preceding siblings ...)
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-07-08 14:13 ` Anatoly Burakov
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-08 14:13 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v6:
- Fix the missed feedback from v5
v4:
- Fix possible out of bounds access
- Added missing index increment
doc/guides/prog_guide/power_man.rst | 15 ++++--
lib/power/rte_power_pmd_mgmt.c | 82 ++++++++++++++++++++++++++++-
2 files changed, 90 insertions(+), 7 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0e66878892..e387d7811e 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,17 +221,22 @@ power saving whenever empty poll count reaches a certain number.
The "monitor" mode is only supported in the following configurations and scenarios:
* On Linux* x86_64, `rte_power_monitor()` requires WAITPKG instruction set being
- supported by the CPU. Please refer to your platform documentation for further
- information.
+ supported by the CPU, while `rte_power_monitor_multi()` requires WAITPKG and
+ RTM instruction sets being supported by the CPU. RTM instruction set may also
+ require booting the Linux with `tsx=on` command line parameter. Please refer
+ to your platform documentation for further information.
* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor_multi()`` function is supported by the platform, then
+ monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
``rte_power_monitor()`` is supported by the platform, then monitoring will be
limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
monitored from a different lcore).
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
- ``rte_power_monitor()`` function is not supported, then monitor mode will not
- be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+ two monitoring functions are supported, then monitor mode will not be supported.
* Not all Ethernet drivers support monitoring, even if the underlying
platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 30772791af..2586204b93 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -126,6 +126,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
return found;
}
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+ struct rte_power_monitor_cond *pmc, size_t len)
+{
+ const struct queue_list_entry *qle;
+ size_t i = 0;
+ int ret;
+
+ TAILQ_FOREACH(qle, &cfg->head, next) {
+ const union queue *q = &qle->queue;
+ struct rte_power_monitor_cond *cur;
+
+ /* attempted out of bounds access */
+ if (i >= len) {
+ RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
+ return -1;
+ }
+
+ cur = &pmc[i++];
+ ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
static void
calc_tsc(void)
{
@@ -215,6 +241,46 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
return true;
}
+static uint16_t
+clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
+{
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
+
+ lcore_conf = &lcore_cfgs[lcore];
+
+ /* early exit */
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ struct rte_power_monitor_cond pmc[lcore_conf->n_queues];
+ int ret;
+
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* gather all monitoring conditions */
+ ret = get_monitor_addresses(lcore_conf, pmc,
+ lcore_conf->n_queues);
+ if (ret < 0)
+ return nb_rx;
+
+ rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX);
+ }
+
+ return nb_rx;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
@@ -366,14 +432,19 @@ static int
check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
{
struct rte_power_monitor_cond dummy;
+ bool multimonitor_supported;
/* check if rte_power_monitor is supported */
if (!global_data.intrinsics_support.power_monitor) {
RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
return -ENOTSUP;
}
+ /* check if multi-monitor is supported */
+ multimonitor_supported =
+ global_data.intrinsics_support.power_monitor_multi;
- if (cfg->n_queues > 0) {
+ /* if we're adding a new queue, do we support multiple queues? */
+ if (cfg->n_queues > 0 && !multimonitor_supported) {
RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
return -ENOTSUP;
}
@@ -389,6 +460,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
return 0;
}
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+ return global_data.intrinsics_support.power_monitor_multi ?
+ clb_multiwait : clb_umwait;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -453,7 +531,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (ret < 0)
goto end;
- clb = clb_umwait;
+ clb = get_monitor_callback();
break;
case RTE_POWER_MGMT_TYPE_SCALE:
/* check if we can add a new queue */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v8 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management Anatoly Burakov
` (5 preceding siblings ...)
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 6/7] power: support monitoring " Anatoly Burakov
@ 2021-07-08 14:13 ` Anatoly Burakov
2021-07-09 14:50 ` David Marchand
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-08 14:13 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
examples/l3fwd-power/main.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..52f56dc405 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2723,12 +2723,6 @@ main(int argc, char **argv)
printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
fflush(stdout);
- /* PMD power management mode can only do 1 queue per core */
- if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
- rte_exit(EXIT_FAILURE,
- "In PMD power management mode, only one queue per lcore is allowed\n");
- }
-
/* init RX queues */
for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
struct rte_eth_rxconf rxq_conf;
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 0/7] Enhancements for PMD power management Anatoly Burakov
` (6 preceding siblings ...)
2021-07-08 14:13 ` [dpdk-dev] [PATCH v8 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-07-09 15:53 ` Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 1/8] eal: use callbacks for power monitoring comparison Anatoly Burakov
` (9 more replies)
7 siblings, 10 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 15:53 UTC (permalink / raw)
To: dev; +Cc: david.hunt, ciara.loftus, konstantin.ananyev
This patchset introduces several changes related to PMD power management:
- Changed monitoring intrinsics to use callbacks as a comparison function, based
on previous patchset [1] but incorporating feedback [2] - this hopefully will
make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
accompanying infrastructure and example apps changes
v9:
- Added all missing Acks and Tests
- Added a new commit with NIC features
- Addressed minor issues raised in review
v8:
- Fixed checkpatch issue
- Added comment explaining empty poll handling (Konstantin)
v7:
- Fixed various bugs
v6:
- Improved the algorithm for multi-queue sleep
- Fixed segfault and addressed other feedback
v5:
- Removed "power save queue" API and replaced with mechanism suggested by
Konstantin
- Addressed other feedback
v4:
- Replaced raw number with a macro
- Fixed all the bugs found by Konstantin
- Some other minor corrections
v3:
- Moved some doc updates to NIC features list
v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary
[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274
Anatoly Burakov (8):
eal: use callbacks for power monitoring comparison
net/af_xdp: add power monitor support
doc: add PMD power management NIC feature
eal: add power monitor for multiple events
power: remove thread safety from PMD power API's
power: support callbacks for multiple Rx queues
power: support monitoring multiple Rx queues
examples/l3fwd-power: support multiq in PMD modes
doc/guides/nics/features.rst | 10 +
doc/guides/nics/features/default.ini | 1 +
doc/guides/prog_guide/power_man.rst | 74 +-
doc/guides/rel_notes/release_21_08.rst | 11 +
drivers/event/dlb2/dlb2.c | 17 +-
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +
drivers/net/i40e/i40e_rxtx.c | 20 +-
drivers/net/iavf/iavf_rxtx.c | 20 +-
drivers/net/ice/ice_rxtx.c | 20 +-
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +-
drivers/net/mlx5/mlx5_rx.c | 17 +-
examples/l3fwd-power/main.c | 6 -
lib/eal/arm/rte_power_intrinsics.c | 11 +
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 68 +-
lib/eal/ppc/rte_power_intrinsics.c | 11 +
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 90 ++-
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 663 +++++++++++++-----
lib/power/rte_power_pmd_mgmt.h | 6 +
22 files changed, 847 insertions(+), 262 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 1/8] eal: use callbacks for power monitoring comparison
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
@ 2021-07-09 15:53 ` Anatoly Burakov
2021-07-09 16:00 ` Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 2/8] net/af_xdp: add power monitor support Anatoly Burakov
` (8 subsequent siblings)
9 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 15:53 UTC (permalink / raw)
To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.
This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.
Existing implementations are adjusted to follow the new semantics.
Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
Acked-by: Timothy McDaniel <timothy.mcdaniel@intel.com>
---
Notes:
v4:
- Return error if callback is set to NULL
- Replace raw number with a macro in monitor condition opaque data
v2:
- Use callback mechanism for more flexibility
- Address feedback from Konstantin
doc/guides/rel_notes/release_21_08.rst | 2 ++
drivers/event/dlb2/dlb2.c | 17 ++++++++--
drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
drivers/net/ice/ice_rxtx.c | 20 +++++++----
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
.../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
9 files changed, 122 insertions(+), 44 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 476822b47f..912fb13b84 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -144,6 +144,8 @@ API Changes
* eal: ``rte_strscpy`` sets ``rte_errno`` to ``E2BIG`` in case of string
truncation.
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+
ABI Changes
-----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
}
}
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ /* abort if the value matches */
+ return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
static inline int
dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
expected_value = 0;
pmc.addr = monitor_addr;
- pmc.val = expected_value;
- pmc.mask = qe_mask.raw_qe[1];
+ /* store expected value and comparison mask in opaque data */
+ pmc.opaque[CLB_VAL_IDX] = expected_value;
+ pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+ /* set up callback */
+ pmc.fn = dlb2_monitor_callback;
pmc.size = sizeof(uint64_t);
rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index e518409fe5..8489f91f1d 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
#define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
+static int
+i40e_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = i40e_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index f817fbc49b..d61b32fcee 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
}
+static int
+iavf_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = iavf_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 3f6e735984..5d7ab4f047 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
+static int
+ice_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.status_error0;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
- pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /* comparison callback */
+ pmc->fn = ice_monitor_callback;
/* register is 16-bit */
pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
};
+static int
+ixgbe_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.upper.status_error;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
- pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /* comparison callback */
+ pmc->fn = ixgbe_monitor_callback;
/* the registers are 32-bit */
pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..8d47637892 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
return rx_queue_count(rxq);
}
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx5_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t m = opaque[CLB_MSK_IDX];
+ const uint64_t v = opaque[CLB_VAL_IDX];
+
+ return (value & m) == v ? -1 : 0;
+}
+
int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
return -rte_errno;
}
pmc->addr = &cqe->op_own;
- pmc->val = !!idx;
- pmc->mask = MLX5_CQE_OWNER_MASK;
+ pmc->opaque[CLB_VAL_IDX] = !!idx;
+ pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+ pmc->fn = mlx5_monitor_callback;
pmc->size = sizeof(uint8_t);
return 0;
}
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
* which are architecture-dependent.
*/
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ * The value read from memory.
+ * @param opaque
+ * Callback-specific data.
+ *
+ * @return
+ * 0 if entering of power optimized state should proceed
+ * -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
struct rte_power_monitor_cond {
volatile void *addr; /**< Address to monitor for changes */
- uint64_t val; /**< If the `mask` is non-zero, location pointed
- * to by `addr` will be read and compared
- * against this value.
- */
- uint64_t mask; /**< 64-bit mask to extract value read from `addr` */
- uint8_t size; /**< Data size (in bytes) that will be used to compare
- * expected value (`val`) with data read from the
+ uint8_t size; /**< Data size (in bytes) that will be read from the
* monitored memory location (`addr`). Can be 1, 2,
* 4, or 8. Supplying any other value will result in
* an error.
*/
+ rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+ * entering power optimized state should
+ * be aborted.
+ */
+ uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+ /**< Callback-specific data */
};
/**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
const unsigned int lcore_id = rte_lcore_id();
struct power_wait_status *s;
+ uint64_t cur_value;
/* prevent user from running this instruction if it's not supported */
if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
if (__check_val_size(pmc->size) < 0)
return -EINVAL;
+ if (pmc->fn == NULL)
+ return -EINVAL;
+
s = &wait_status[lcore_id];
/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
/* now that we've put this address into monitor, we can unlock */
rte_spinlock_unlock(&s->lock);
- /* if we have a comparison mask, we might not need to sleep at all */
- if (pmc->mask) {
- const uint64_t cur_value = __get_umwait_val(
- pmc->addr, pmc->size);
- const uint64_t masked = cur_value & pmc->mask;
+ cur_value = __get_umwait_val(pmc->addr, pmc->size);
- /* if the masked value is already matching, abort */
- if (masked == pmc->val)
- goto end;
- }
+ /* check if callback indicates we should abort */
+ if (pmc->fn(cur_value, pmc->opaque) != 0)
+ goto end;
/* execute UMWAIT */
asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 1/8] eal: use callbacks for power monitoring comparison
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 1/8] eal: use callbacks for power monitoring comparison Anatoly Burakov
@ 2021-07-09 16:00 ` Anatoly Burakov
0 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:00 UTC (permalink / raw)
To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.
This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.
Existing implementations are adjusted to follow the new semantics.
Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
Acked-by: Timothy McDaniel <timothy.mcdaniel@intel.com>
---
Notes:
v4:
- Return error if callback is set to NULL
- Replace raw number with a macro in monitor condition opaque data
v2:
- Use callback mechanism for more flexibility
- Address feedback from Konstantin
doc/guides/rel_notes/release_21_08.rst | 2 ++
drivers/event/dlb2/dlb2.c | 17 ++++++++--
drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
drivers/net/ice/ice_rxtx.c | 20 +++++++----
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
.../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
9 files changed, 122 insertions(+), 44 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 476822b47f..912fb13b84 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -144,6 +144,8 @@ API Changes
* eal: ``rte_strscpy`` sets ``rte_errno`` to ``E2BIG`` in case of string
truncation.
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+
ABI Changes
-----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
}
}
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ /* abort if the value matches */
+ return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
static inline int
dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
expected_value = 0;
pmc.addr = monitor_addr;
- pmc.val = expected_value;
- pmc.mask = qe_mask.raw_qe[1];
+ /* store expected value and comparison mask in opaque data */
+ pmc.opaque[CLB_VAL_IDX] = expected_value;
+ pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+ /* set up callback */
+ pmc.fn = dlb2_monitor_callback;
pmc.size = sizeof(uint64_t);
rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index e518409fe5..8489f91f1d 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
#define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
+static int
+i40e_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = i40e_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index f817fbc49b..d61b32fcee 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
}
+static int
+iavf_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = iavf_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 3f6e735984..5d7ab4f047 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
+static int
+ice_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.status_error0;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
- pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /* comparison callback */
+ pmc->fn = ice_monitor_callback;
/* register is 16-bit */
pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
};
+static int
+ixgbe_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.upper.status_error;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
- pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /* comparison callback */
+ pmc->fn = ixgbe_monitor_callback;
/* the registers are 32-bit */
pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..8d47637892 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
return rx_queue_count(rxq);
}
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx5_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t m = opaque[CLB_MSK_IDX];
+ const uint64_t v = opaque[CLB_VAL_IDX];
+
+ return (value & m) == v ? -1 : 0;
+}
+
int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
return -rte_errno;
}
pmc->addr = &cqe->op_own;
- pmc->val = !!idx;
- pmc->mask = MLX5_CQE_OWNER_MASK;
+ pmc->opaque[CLB_VAL_IDX] = !!idx;
+ pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+ pmc->fn = mlx5_monitor_callback;
pmc->size = sizeof(uint8_t);
return 0;
}
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
* which are architecture-dependent.
*/
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ * The value read from memory.
+ * @param opaque
+ * Callback-specific data.
+ *
+ * @return
+ * 0 if entering of power optimized state should proceed
+ * -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
struct rte_power_monitor_cond {
volatile void *addr; /**< Address to monitor for changes */
- uint64_t val; /**< If the `mask` is non-zero, location pointed
- * to by `addr` will be read and compared
- * against this value.
- */
- uint64_t mask; /**< 64-bit mask to extract value read from `addr` */
- uint8_t size; /**< Data size (in bytes) that will be used to compare
- * expected value (`val`) with data read from the
+ uint8_t size; /**< Data size (in bytes) that will be read from the
* monitored memory location (`addr`). Can be 1, 2,
* 4, or 8. Supplying any other value will result in
* an error.
*/
+ rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+ * entering power optimized state should
+ * be aborted.
+ */
+ uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+ /**< Callback-specific data */
};
/**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
const unsigned int lcore_id = rte_lcore_id();
struct power_wait_status *s;
+ uint64_t cur_value;
/* prevent user from running this instruction if it's not supported */
if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
if (__check_val_size(pmc->size) < 0)
return -EINVAL;
+ if (pmc->fn == NULL)
+ return -EINVAL;
+
s = &wait_status[lcore_id];
/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
/* now that we've put this address into monitor, we can unlock */
rte_spinlock_unlock(&s->lock);
- /* if we have a comparison mask, we might not need to sleep at all */
- if (pmc->mask) {
- const uint64_t cur_value = __get_umwait_val(
- pmc->addr, pmc->size);
- const uint64_t masked = cur_value & pmc->mask;
+ cur_value = __get_umwait_val(pmc->addr, pmc->size);
- /* if the masked value is already matching, abort */
- if (masked == pmc->val)
- goto end;
- }
+ /* check if callback indicates we should abort */
+ if (pmc->fn(cur_value, pmc->opaque) != 0)
+ goto end;
/* execute UMWAIT */
asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 2/8] net/af_xdp: add power monitor support
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 1/8] eal: use callbacks for power monitoring comparison Anatoly Burakov
@ 2021-07-09 15:53 ` Anatoly Burakov
2021-07-09 16:00 ` Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 3/8] doc: add PMD power management NIC feature Anatoly Burakov
` (7 subsequent siblings)
9 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 15:53 UTC (permalink / raw)
To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt, konstantin.ananyev
Implement support for .get_monitor_addr in AF_XDP driver.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v8:
- Fix checkpatch issue
v2:
- Rewrite using the callback mechanism
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..989051dd6d 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
#include <rte_malloc.h>
#include <rte_ring.h>
#include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
#include "compat.h"
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
return 0;
}
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t v = opaque[CLB_VAL_IDX];
+ const uint64_t m = (uint32_t)~0;
+
+ /* if the value has changed, abort entering power optimized state */
+ return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+ struct pkt_rx_queue *rxq = rx_queue;
+ unsigned int *prod = rxq->rx.producer;
+ const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+ /* watch for changes in producer ring */
+ pmc->addr = (void *)prod;
+
+ /* store current value */
+ pmc->opaque[CLB_VAL_IDX] = cur_val;
+ pmc->fn = eth_monitor_callback;
+
+ /* AF_XDP producer ring index is 32-bit */
+ pmc->size = sizeof(uint32_t);
+
+ return 0;
+}
+
static int
eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
{
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
.link_update = eth_link_update,
.stats_get = eth_stats_get,
.stats_reset = eth_stats_reset,
+ .get_monitor_addr = eth_get_monitor_addr
};
/** parse busy_budget argument */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 2/8] net/af_xdp: add power monitor support
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 2/8] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-07-09 16:00 ` Anatoly Burakov
0 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:00 UTC (permalink / raw)
To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt, konstantin.ananyev
Implement support for .get_monitor_addr in AF_XDP driver.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v8:
- Fix checkpatch issue
v2:
- Rewrite using the callback mechanism
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..989051dd6d 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
#include <rte_malloc.h>
#include <rte_ring.h>
#include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
#include "compat.h"
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
return 0;
}
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t v = opaque[CLB_VAL_IDX];
+ const uint64_t m = (uint32_t)~0;
+
+ /* if the value has changed, abort entering power optimized state */
+ return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+ struct pkt_rx_queue *rxq = rx_queue;
+ unsigned int *prod = rxq->rx.producer;
+ const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+ /* watch for changes in producer ring */
+ pmc->addr = (void *)prod;
+
+ /* store current value */
+ pmc->opaque[CLB_VAL_IDX] = cur_val;
+ pmc->fn = eth_monitor_callback;
+
+ /* AF_XDP producer ring index is 32-bit */
+ pmc->size = sizeof(uint32_t);
+
+ return 0;
+}
+
static int
eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
{
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
.link_update = eth_link_update,
.stats_get = eth_stats_get,
.stats_reset = eth_stats_reset,
+ .get_monitor_addr = eth_get_monitor_addr
};
/** parse busy_budget argument */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 3/8] doc: add PMD power management NIC feature
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 1/8] eal: use callbacks for power monitoring comparison Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 2/8] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-07-09 15:53 ` Anatoly Burakov
2021-07-09 15:57 ` Burakov, Anatoly
2021-07-09 16:00 ` Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 4/8] eal: add power monitor for multiple events Anatoly Burakov
` (6 subsequent siblings)
9 siblings, 2 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 15:53 UTC (permalink / raw)
To: dev, Ferruh Yigit
Cc: david.hunt, ciara.loftus, konstantin.ananyev, David Marchand
At this point, multiple different Ethernet drivers from multiple vendors
will support the PMD power management scheme. It would be useful to add
it to the NIC feature table to indicate support for it.
Suggested-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
doc/guides/nics/features.rst | 10 ++++++++++
doc/guides/nics/features/default.ini | 1 +
2 files changed, 11 insertions(+)
diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
* **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
* **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
.. _nic_features_other:
Other dev ops not represented by a Feature
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 3b55e0ccb0..f1e947bd9e 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -76,6 +76,7 @@ x86-64 =
Usage doc =
Design doc =
Perf doc =
+Power mgmt address monitor =
[rte_flow items]
ah =
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v9 3/8] doc: add PMD power management NIC feature
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 3/8] doc: add PMD power management NIC feature Anatoly Burakov
@ 2021-07-09 15:57 ` Burakov, Anatoly
2021-07-09 16:00 ` Anatoly Burakov
1 sibling, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-09 15:57 UTC (permalink / raw)
To: dev, Ferruh Yigit
Cc: david.hunt, ciara.loftus, konstantin.ananyev, David Marchand
On 09-Jul-21 4:53 PM, Anatoly Burakov wrote:
> At this point, multiple different Ethernet drivers from multiple vendors
> will support the PMD power management scheme. It would be useful to add
> it to the NIC feature table to indicate support for it.
>
> Suggested-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> doc/guides/nics/features.rst | 10 ++++++++++
> doc/guides/nics/features/default.ini | 1 +
> 2 files changed, 11 insertions(+)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index 403c2b03a3..a96e12d155 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
> * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
> * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
>
> +.. _nic_features_get_monitor_addr:
> +
> +PMD power management using monitor addresses
> +--------------------------------------------
> +
> +Supports getting a monitoring condition to use together with Ethernet PMD power
> +management (see :doc:`../prog_guide/power_man` for more details).
> +
> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
> +
> .. _nic_features_other:
>
> Other dev ops not represented by a Feature
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index 3b55e0ccb0..f1e947bd9e 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -76,6 +76,7 @@ x86-64 =
> Usage doc =
> Design doc =
> Perf doc =
> +Power mgmt address monitor =
>
> [rte_flow items]
> ah =
>
Apologies, forgot to git add the driver files to the commit. Will respin
shortly.
--
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 3/8] doc: add PMD power management NIC feature
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 3/8] doc: add PMD power management NIC feature Anatoly Burakov
2021-07-09 15:57 ` Burakov, Anatoly
@ 2021-07-09 16:00 ` Anatoly Burakov
1 sibling, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:00 UTC (permalink / raw)
To: dev, Ferruh Yigit
Cc: david.hunt, ciara.loftus, konstantin.ananyev, David Marchand
At this point, multiple different Ethernet drivers from multiple vendors
will support the PMD power management scheme. It would be useful to add
it to the NIC feature table to indicate support for it.
Suggested-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
doc/guides/nics/features.rst | 10 ++++++++++
doc/guides/nics/features/default.ini | 1 +
2 files changed, 11 insertions(+)
diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
* **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
* **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
.. _nic_features_other:
Other dev ops not represented by a Feature
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 3b55e0ccb0..f1e947bd9e 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -76,6 +76,7 @@ x86-64 =
Usage doc =
Design doc =
Perf doc =
+Power mgmt address monitor =
[rte_flow items]
ah =
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 4/8] eal: add power monitor for multiple events
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
` (2 preceding siblings ...)
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 3/8] doc: add PMD power management NIC feature Anatoly Burakov
@ 2021-07-09 15:53 ` Anatoly Burakov
2021-07-09 16:00 ` Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 5/8] power: remove thread safety from PMD power API's Anatoly Burakov
` (5 subsequent siblings)
9 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 15:53 UTC (permalink / raw)
To: dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v4:
- Fixed bugs in accessing the monitor condition
- Abort on any monitor condition not having a defined callback
v2:
- Adapt to callback mechanism
lib/eal/arm/rte_power_intrinsics.c | 11 +++
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 35 +++++++++
lib/eal/ppc/rte_power_intrinsics.c | 11 +++
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++
7 files changed, 137 insertions(+)
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
/**< indicates support for rte_power_monitor function */
uint32_t power_pause : 1;
/**< indicates support for rte_power_pause function */
+ uint32_t power_monitor_multi : 1;
+ /**< indicates support for rte_power_monitor_multi function */
};
/**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
__rte_experimental
int rte_power_pause(const uint64_t tsc_timestamp);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ * supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ * Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ * An array of monitoring condition structures.
+ * @param num
+ * Length of the `pmc` array.
+ * @param tsc_timestamp
+ * Maximum TSC timestamp to wait for. Note that the wait behavior is
+ * architecture-dependent.
+ *
+ * @return
+ * 0 on success
+ * -EINVAL on invalid parameters
+ * -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp);
+
#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 2df65c6903..887012d02a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
rte_version_release; # WINDOWS_NO_EXPORT
rte_version_suffix; # WINDOWS_NO_EXPORT
rte_version_year; # WINDOWS_NO_EXPORT
+
+ # added in 21.08
+ rte_power_monitor_multi; # WINDOWS_NO_EXPORT
};
INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
intrinsics->power_monitor = 1;
intrinsics->power_pause = 1;
+ if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+ intrinsics->power_monitor_multi = 1;
}
}
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
#include <rte_common.h>
#include <rte_lcore.h>
+#include <rte_rtm.h>
#include <rte_spinlock.h>
#include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
}
static bool wait_supported;
+static bool wait_multi_supported;
static inline uint64_t
__get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
if (i.power_monitor && i.power_pause)
wait_supported = 1;
+ if (i.power_monitor_multi)
+ wait_multi_supported = 1;
}
int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
* In this case, since we've already woken up, the "wakeup" was
* unneeded, and since T1 is still waiting on T2 releasing the lock, the
* wakeup address is still valid so it's perfectly safe to write it.
+ *
+ * For multi-monitor case, the act of locking will in itself trigger the
+ * wakeup, so no additional writes necessary.
*/
rte_spinlock_lock(&s->lock);
if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return 0;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ const unsigned int lcore_id = rte_lcore_id();
+ struct power_wait_status *s = &wait_status[lcore_id];
+ uint32_t i, rc;
+
+ /* check if supported */
+ if (!wait_multi_supported)
+ return -ENOTSUP;
+
+ if (pmc == NULL || num == 0)
+ return -EINVAL;
+
+ /* we are already inside transaction region, return */
+ if (rte_xtest() != 0)
+ return 0;
+
+ /* start new transaction region */
+ rc = rte_xbegin();
+
+ /* transaction abort, possible write to one of wait addresses */
+ if (rc != RTE_XBEGIN_STARTED)
+ return 0;
+
+ /*
+ * the mere act of reading the lock status here adds the lock to
+ * the read set. This means that when we trigger a wakeup from another
+ * thread, even if we don't have a defined wakeup address and thus don't
+ * actually cause any writes, the act of locking our lock will itself
+ * trigger the wakeup and abort the transaction.
+ */
+ rte_spinlock_is_locked(&s->lock);
+
+ /*
+ * add all addresses to wait on into transaction read-set and check if
+ * any of wakeup conditions are already met.
+ */
+ rc = 0;
+ for (i = 0; i < num; i++) {
+ const struct rte_power_monitor_cond *c = &pmc[i];
+
+ /* cannot be NULL */
+ if (c->fn == NULL) {
+ rc = -EINVAL;
+ break;
+ }
+
+ const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+ /* abort if callback indicates that we need to stop */
+ if (c->fn(val, c->opaque) != 0)
+ break;
+ }
+
+ /* none of the conditions were met, sleep until timeout */
+ if (i == num)
+ rte_power_pause(tsc_timestamp);
+
+ /* end transaction region */
+ rte_xend();
+
+ return rc;
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 4/8] eal: add power monitor for multiple events
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 4/8] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-07-09 16:00 ` Anatoly Burakov
0 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:00 UTC (permalink / raw)
To: dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v4:
- Fixed bugs in accessing the monitor condition
- Abort on any monitor condition not having a defined callback
v2:
- Adapt to callback mechanism
lib/eal/arm/rte_power_intrinsics.c | 11 +++
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 35 +++++++++
lib/eal/ppc/rte_power_intrinsics.c | 11 +++
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++
7 files changed, 137 insertions(+)
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
/**< indicates support for rte_power_monitor function */
uint32_t power_pause : 1;
/**< indicates support for rte_power_pause function */
+ uint32_t power_monitor_multi : 1;
+ /**< indicates support for rte_power_monitor_multi function */
};
/**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
__rte_experimental
int rte_power_pause(const uint64_t tsc_timestamp);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ * supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ * Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ * An array of monitoring condition structures.
+ * @param num
+ * Length of the `pmc` array.
+ * @param tsc_timestamp
+ * Maximum TSC timestamp to wait for. Note that the wait behavior is
+ * architecture-dependent.
+ *
+ * @return
+ * 0 on success
+ * -EINVAL on invalid parameters
+ * -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp);
+
#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 2df65c6903..887012d02a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
rte_version_release; # WINDOWS_NO_EXPORT
rte_version_suffix; # WINDOWS_NO_EXPORT
rte_version_year; # WINDOWS_NO_EXPORT
+
+ # added in 21.08
+ rte_power_monitor_multi; # WINDOWS_NO_EXPORT
};
INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
intrinsics->power_monitor = 1;
intrinsics->power_pause = 1;
+ if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+ intrinsics->power_monitor_multi = 1;
}
}
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
#include <rte_common.h>
#include <rte_lcore.h>
+#include <rte_rtm.h>
#include <rte_spinlock.h>
#include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
}
static bool wait_supported;
+static bool wait_multi_supported;
static inline uint64_t
__get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
if (i.power_monitor && i.power_pause)
wait_supported = 1;
+ if (i.power_monitor_multi)
+ wait_multi_supported = 1;
}
int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
* In this case, since we've already woken up, the "wakeup" was
* unneeded, and since T1 is still waiting on T2 releasing the lock, the
* wakeup address is still valid so it's perfectly safe to write it.
+ *
+ * For multi-monitor case, the act of locking will in itself trigger the
+ * wakeup, so no additional writes necessary.
*/
rte_spinlock_lock(&s->lock);
if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return 0;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ const unsigned int lcore_id = rte_lcore_id();
+ struct power_wait_status *s = &wait_status[lcore_id];
+ uint32_t i, rc;
+
+ /* check if supported */
+ if (!wait_multi_supported)
+ return -ENOTSUP;
+
+ if (pmc == NULL || num == 0)
+ return -EINVAL;
+
+ /* we are already inside transaction region, return */
+ if (rte_xtest() != 0)
+ return 0;
+
+ /* start new transaction region */
+ rc = rte_xbegin();
+
+ /* transaction abort, possible write to one of wait addresses */
+ if (rc != RTE_XBEGIN_STARTED)
+ return 0;
+
+ /*
+ * the mere act of reading the lock status here adds the lock to
+ * the read set. This means that when we trigger a wakeup from another
+ * thread, even if we don't have a defined wakeup address and thus don't
+ * actually cause any writes, the act of locking our lock will itself
+ * trigger the wakeup and abort the transaction.
+ */
+ rte_spinlock_is_locked(&s->lock);
+
+ /*
+ * add all addresses to wait on into transaction read-set and check if
+ * any of wakeup conditions are already met.
+ */
+ rc = 0;
+ for (i = 0; i < num; i++) {
+ const struct rte_power_monitor_cond *c = &pmc[i];
+
+ /* cannot be NULL */
+ if (c->fn == NULL) {
+ rc = -EINVAL;
+ break;
+ }
+
+ const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+ /* abort if callback indicates that we need to stop */
+ if (c->fn(val, c->opaque) != 0)
+ break;
+ }
+
+ /* none of the conditions were met, sleep until timeout */
+ if (i == num)
+ rte_power_pause(tsc_timestamp);
+
+ /* end transaction region */
+ rte_xend();
+
+ return rc;
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 5/8] power: remove thread safety from PMD power API's
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
` (3 preceding siblings ...)
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 4/8] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-07-09 15:53 ` Anatoly Burakov
2021-07-09 16:00 ` Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 6/8] power: support callbacks for multiple Rx queues Anatoly Burakov
` (4 subsequent siblings)
9 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 15:53 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.
We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v2:
- Add check for stopped queue
- Clarified doc message
- Added release notes
doc/guides/rel_notes/release_21_08.rst | 4 +
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 133 ++++++++++---------------
lib/power/rte_power_pmd_mgmt.h | 6 ++
4 files changed, 66 insertions(+), 80 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 912fb13b84..b9a3caabf0 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -146,6 +146,10 @@ API Changes
* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+* rte_power: The experimental PMD power management API is no longer considered
+ to be thread safe; all Rx queues affected by the API will now need to be
+ stopped before making any changes to the power management scheme.
+
ABI Changes
-----------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index 36e5a65874..bf937acde4 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -22,4 +22,7 @@ headers = files(
'rte_power_pmd_mgmt.h',
'rte_power_guest_channel.h',
)
+if cc.has_argument('-Wno-cast-qual')
+ cflags += '-Wno-cast-qual'
+endif
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
/**< Callback mode for this queue */
const struct rte_eth_rxtx_callback *cur_cb;
/**< Callback instance */
- volatile bool umwait_in_progress;
- /**< are we currently sleeping? */
uint64_t empty_poll_stats;
/**< Number of empty polls */
} __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
struct rte_power_monitor_cond pmc;
uint16_t ret;
- /*
- * we might get a cancellation request while being
- * inside the callback, in which case the wakeup
- * wouldn't work because it would've arrived too early.
- *
- * to get around this, we notify the other thread that
- * we're sleeping, so that it can spin until we're done.
- * unsolicited wakeups are perfectly safe.
- */
- q_conf->umwait_in_progress = true;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- /* check if we need to cancel sleep */
- if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
- /* use monitoring condition to sleep */
- ret = rte_eth_get_monitor_addr(port_id, qidx,
- &pmc);
- if (ret == 0)
- rte_power_monitor(&pmc, UINT64_MAX);
- }
- q_conf->umwait_in_progress = false;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+ /* use monitoring condition to sleep */
+ ret = rte_eth_get_monitor_addr(port_id, qidx,
+ &pmc);
+ if (ret == 0)
+ rte_power_monitor(&pmc, UINT64_MAX);
}
} else
q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
return nb_rx;
}
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+ struct rte_eth_rxq_info qinfo;
+
+ if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+ return -1;
+
+ return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
struct pmd_queue_cfg *queue_cfg;
struct rte_eth_dev_info info;
+ rte_rx_callback_fn clb;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
queue_cfg = &port_cfg[port_id][queue_id];
if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->umwait_in_progress = false;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* ensure we update our state before callback starts */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_umwait, NULL);
+ clb = clb_umwait;
break;
}
case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
- queue_id, clb_scale_freq, NULL);
+ clb = clb_scale_freq;
break;
}
case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (global_data.tsc_per_us == 0)
calc_tsc();
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_pause, NULL);
+ clb = clb_pause;
break;
+ default:
+ RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+ ret = -EINVAL;
+ goto end;
}
+
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, NULL);
+
ret = 0;
end:
return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
struct pmd_queue_cfg *queue_cfg;
+ int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
return -EINVAL;
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
/* no need to check queue id as wrong queue id would not be enabled */
queue_cfg = &port_cfg[port_id][queue_id];
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
/* stop any callbacks from progressing */
queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
- /* ensure we update our state before continuing */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
switch (queue_cfg->cb_mode) {
- case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- bool exit = false;
- do {
- /*
- * we may request cancellation while the other thread
- * has just entered the callback but hasn't started
- * sleeping yet, so keep waking it up until we know it's
- * done sleeping.
- */
- if (queue_cfg->umwait_in_progress)
- rte_power_monitor_wakeup(lcore_id);
- else
- exit = true;
- } while (!exit);
- }
- /* fall-through */
+ case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
rte_eth_remove_rx_callback(port_id, queue_id,
queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
break;
}
/*
- * we don't free the RX callback here because it is unsafe to do so
- * unless we know for a fact that all data plane threads have stopped.
+ * the API doc mandates that the user stops all processing on affected
+ * ports before calling any of these API's, so we can assume that the
+ * callbacks can be freed. we're intentionally casting away const-ness.
*/
- queue_cfg->cur_cb = NULL;
+ rte_free((void *)queue_cfg->cur_cb);
return 0;
}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue will be polled from.
* @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue is polled from.
* @param port_id
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 5/8] power: remove thread safety from PMD power API's
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 5/8] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-07-09 16:00 ` Anatoly Burakov
0 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:00 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.
We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v2:
- Add check for stopped queue
- Clarified doc message
- Added release notes
doc/guides/rel_notes/release_21_08.rst | 4 +
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 133 ++++++++++---------------
lib/power/rte_power_pmd_mgmt.h | 6 ++
4 files changed, 66 insertions(+), 80 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 912fb13b84..b9a3caabf0 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -146,6 +146,10 @@ API Changes
* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+* rte_power: The experimental PMD power management API is no longer considered
+ to be thread safe; all Rx queues affected by the API will now need to be
+ stopped before making any changes to the power management scheme.
+
ABI Changes
-----------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index 36e5a65874..bf937acde4 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -22,4 +22,7 @@ headers = files(
'rte_power_pmd_mgmt.h',
'rte_power_guest_channel.h',
)
+if cc.has_argument('-Wno-cast-qual')
+ cflags += '-Wno-cast-qual'
+endif
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
/**< Callback mode for this queue */
const struct rte_eth_rxtx_callback *cur_cb;
/**< Callback instance */
- volatile bool umwait_in_progress;
- /**< are we currently sleeping? */
uint64_t empty_poll_stats;
/**< Number of empty polls */
} __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
struct rte_power_monitor_cond pmc;
uint16_t ret;
- /*
- * we might get a cancellation request while being
- * inside the callback, in which case the wakeup
- * wouldn't work because it would've arrived too early.
- *
- * to get around this, we notify the other thread that
- * we're sleeping, so that it can spin until we're done.
- * unsolicited wakeups are perfectly safe.
- */
- q_conf->umwait_in_progress = true;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- /* check if we need to cancel sleep */
- if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
- /* use monitoring condition to sleep */
- ret = rte_eth_get_monitor_addr(port_id, qidx,
- &pmc);
- if (ret == 0)
- rte_power_monitor(&pmc, UINT64_MAX);
- }
- q_conf->umwait_in_progress = false;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+ /* use monitoring condition to sleep */
+ ret = rte_eth_get_monitor_addr(port_id, qidx,
+ &pmc);
+ if (ret == 0)
+ rte_power_monitor(&pmc, UINT64_MAX);
}
} else
q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
return nb_rx;
}
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+ struct rte_eth_rxq_info qinfo;
+
+ if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+ return -1;
+
+ return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
struct pmd_queue_cfg *queue_cfg;
struct rte_eth_dev_info info;
+ rte_rx_callback_fn clb;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
queue_cfg = &port_cfg[port_id][queue_id];
if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->umwait_in_progress = false;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* ensure we update our state before callback starts */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_umwait, NULL);
+ clb = clb_umwait;
break;
}
case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
- queue_id, clb_scale_freq, NULL);
+ clb = clb_scale_freq;
break;
}
case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (global_data.tsc_per_us == 0)
calc_tsc();
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_pause, NULL);
+ clb = clb_pause;
break;
+ default:
+ RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+ ret = -EINVAL;
+ goto end;
}
+
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, NULL);
+
ret = 0;
end:
return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
struct pmd_queue_cfg *queue_cfg;
+ int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
return -EINVAL;
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
/* no need to check queue id as wrong queue id would not be enabled */
queue_cfg = &port_cfg[port_id][queue_id];
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
/* stop any callbacks from progressing */
queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
- /* ensure we update our state before continuing */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
switch (queue_cfg->cb_mode) {
- case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- bool exit = false;
- do {
- /*
- * we may request cancellation while the other thread
- * has just entered the callback but hasn't started
- * sleeping yet, so keep waking it up until we know it's
- * done sleeping.
- */
- if (queue_cfg->umwait_in_progress)
- rte_power_monitor_wakeup(lcore_id);
- else
- exit = true;
- } while (!exit);
- }
- /* fall-through */
+ case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
rte_eth_remove_rx_callback(port_id, queue_id,
queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
break;
}
/*
- * we don't free the RX callback here because it is unsafe to do so
- * unless we know for a fact that all data plane threads have stopped.
+ * the API doc mandates that the user stops all processing on affected
+ * ports before calling any of these API's, so we can assume that the
+ * callbacks can be freed. we're intentionally casting away const-ness.
*/
- queue_cfg->cur_cb = NULL;
+ rte_free((void *)queue_cfg->cur_cb);
return 0;
}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue will be polled from.
* @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue is polled from.
* @param port_id
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 6/8] power: support callbacks for multiple Rx queues
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
` (4 preceding siblings ...)
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 5/8] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-07-09 15:53 ` Anatoly Burakov
2021-07-09 16:00 ` Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 7/8] power: support monitoring " Anatoly Burakov
` (3 subsequent siblings)
9 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 15:53 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.
The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:
- Replace per-queue structures with per-lcore ones, so that any device
polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
added to the list of queues to poll, so that the callback is aware of
other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
shared between all queues polled on a particular lcore, and is only
activated when all queues in the list were polled and were determined
to have no traffic.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
is incapable of monitoring more than one address.
Also, while we're at it, update and improve the docs.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v8:
- Added a comment explaining that we want to sleep on each empty poll after
threshold has been reached
v7:
- Fix bug where initial sleep target was always set to zero
- Fix logic in handling of n_queues_ready_to_sleep counter
- Update documentation on hardware requirements
v6:
- Track each individual queue sleep status (Konstantin)
- Fix segfault (Dave)
v5:
- Remove the "power save queue" API and replace it with mechanism suggested by
Konstantin
v3:
- Move the list of supported NICs to NIC feature table
v2:
- Use a TAILQ for queues instead of a static array
- Address feedback from Konstantin
- Add additional checks for stopped queues
doc/guides/prog_guide/power_man.rst | 69 ++--
doc/guides/rel_notes/release_21_08.rst | 5 +
lib/power/rte_power_pmd_mgmt.c | 460 +++++++++++++++++++------
3 files changed, 398 insertions(+), 136 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..0e66878892 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,45 @@ Ethernet PMD Power Management API
Abstract
~~~~~~~~
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
- This power saving scheme will put the CPU into optimized power state
- and use the ``rte_power_monitor()`` function
- to monitor the Ethernet PMD RX descriptor address,
- and wake the CPU up whenever there's new traffic.
-
-Pause
- This power saving scheme will avoid busy polling
- by either entering power-optimized sleep state
- with ``rte_power_pause()`` function,
- or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
- This power saving scheme will use ``librte_power`` library
- functionality to scale the core frequency up/down
- depending on traffic volume.
-
-.. note::
-
- Currently, this power management API is limited to mandatory mapping
- of 1 queue to 1 core (multiple queues are supported,
- but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+ This power saving scheme will put the CPU into optimized power state and
+ monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+ there's new traffic. Support for this scheme may not be available on all
+ platforms, and further limitations may apply (see below).
+
+* Pause
+ This power saving scheme will avoid busy polling by either entering
+ power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+ not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+ This power saving scheme will use ``librte_power`` library functionality to
+ scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* On Linux* x86_64, `rte_power_monitor()` requires WAITPKG instruction set being
+ supported by the CPU. Please refer to your platform documentation for further
+ information.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+ limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+ monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+ ``rte_power_monitor()`` function is not supported, then monitor mode will not
+ be supported.
+
+* Not all Ethernet drivers support monitoring, even if the underlying
+ platform may support the necessary CPU instructions. Please refer to
+ :doc:`../nics/overview` for more information.
+
API Overview for Ethernet PMD Power Management
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -242,3 +253,5 @@ References
* The :doc:`../sample_app_ug/vm_power_management`
chapter in the :doc:`../sample_app_ug/index` section.
+
+* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index b9a3caabf0..ca28ebe461 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -112,6 +112,11 @@ New Features
Added support for cppc_cpufreq driver which works on most arm64 platforms.
+* **Added multi-queue support to Ethernet PMD Power Management**
+
+ The experimental PMD power management API now supports managing
+ multiple Ethernet Rx queues per lcore.
+
Removed Items
-------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..30772791af 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,18 +33,98 @@ enum pmd_mgmt_state {
PMD_MGMT_ENABLED
};
-struct pmd_queue_cfg {
+union queue {
+ uint32_t val;
+ struct {
+ uint16_t portid;
+ uint16_t qid;
+ };
+};
+
+struct queue_list_entry {
+ TAILQ_ENTRY(queue_list_entry) next;
+ union queue queue;
+ uint64_t n_empty_polls;
+ uint64_t n_sleeps;
+ const struct rte_eth_rxtx_callback *cb;
+};
+
+struct pmd_core_cfg {
+ TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+ /**< List of queues associated with this lcore */
+ size_t n_queues;
+ /**< How many queues are in the list? */
volatile enum pmd_mgmt_state pwr_mgmt_state;
/**< State of power management for this queue */
enum rte_power_pmd_mgmt_type cb_mode;
/**< Callback mode for this queue */
- const struct rte_eth_rxtx_callback *cur_cb;
- /**< Callback instance */
- uint64_t empty_poll_stats;
- /**< Number of empty polls */
+ uint64_t n_queues_ready_to_sleep;
+ /**< Number of queues ready to enter power optimized state */
+ uint64_t sleep_target;
+ /**< Prevent a queue from triggering sleep multiple times */
} __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+ return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+ dst->val = src->val;
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *cur;
+
+ TAILQ_FOREACH(cur, &cfg->head, next) {
+ if (queue_equal(&cur->queue, q))
+ return cur;
+ }
+ return NULL;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *qle;
+
+ /* is it already in the list? */
+ if (queue_list_find(cfg, q) != NULL)
+ return -EEXIST;
+
+ qle = malloc(sizeof(*qle));
+ if (qle == NULL)
+ return -ENOMEM;
+ memset(qle, 0, sizeof(*qle));
+
+ queue_copy(&qle->queue, q);
+ TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+ cfg->n_queues++;
+
+ return 0;
+}
+
+static struct queue_list_entry *
+queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *found;
+
+ found = queue_list_find(cfg, q);
+ if (found == NULL)
+ return NULL;
+
+ TAILQ_REMOVE(&cfg->head, found, next);
+ cfg->n_queues--;
+
+ /* freeing is responsibility of the caller */
+ return found;
+}
static void
calc_tsc(void)
@@ -74,21 +154,79 @@ calc_tsc(void)
}
}
+static inline void
+queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ const bool is_ready_to_sleep = qcfg->n_sleeps == cfg->sleep_target;
+
+ /* reset empty poll counter for this queue */
+ qcfg->n_empty_polls = 0;
+ /* reset the queue sleep counter as well */
+ qcfg->n_sleeps = 0;
+ /* remove the queue from list of queues ready to sleep */
+ if (is_ready_to_sleep)
+ cfg->n_queues_ready_to_sleep--;
+ /*
+ * no need change the lcore sleep target counter because this lcore will
+ * reach the n_sleeps anyway, and the other cores are already counted so
+ * there's no need to do anything else.
+ */
+}
+
+static inline bool
+queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ /* this function is called - that means we have an empty poll */
+ qcfg->n_empty_polls++;
+
+ /* if we haven't reached threshold for empty polls, we can't sleep */
+ if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
+ return false;
+
+ /*
+ * we've reached a point where we are able to sleep, but we still need
+ * to check if this queue has already been marked for sleeping.
+ */
+ if (qcfg->n_sleeps == cfg->sleep_target)
+ return true;
+
+ /* mark this queue as ready for sleep */
+ qcfg->n_sleeps = cfg->sleep_target;
+ cfg->n_queues_ready_to_sleep++;
+
+ return true;
+}
+
+static inline bool
+lcore_can_sleep(struct pmd_core_cfg *cfg)
+{
+ /* are all queues ready to sleep? */
+ if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
+ return false;
+
+ /* we've reached an iteration where we can sleep, reset sleep counter */
+ cfg->n_queues_ready_to_sleep = 0;
+ cfg->sleep_target++;
+ /*
+ * we do not reset any individual queue empty poll counters, because
+ * we want to keep sleeping on every poll until we actually get traffic.
+ */
+
+ return true;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+ uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
{
+ struct queue_list_entry *queue_conf = arg;
- struct pmd_queue_cfg *q_conf;
-
- q_conf = &port_cfg[port_id][qidx];
-
+ /* this callback can't do more than one queue, omit multiqueue logic */
if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+ queue_conf->n_empty_polls++;
+ if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
struct rte_power_monitor_cond pmc;
- uint16_t ret;
+ int ret;
/* use monitoring condition to sleep */
ret = rte_eth_get_monitor_addr(port_id, qidx,
@@ -97,60 +235,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
rte_power_monitor(&pmc, UINT64_MAX);
}
} else
- q_conf->empty_poll_stats = 0;
+ queue_conf->n_empty_polls = 0;
return nb_rx;
}
static uint16_t
-clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
- q_conf = &port_cfg[port_id][qidx];
+ lcore_conf = &lcore_cfgs[lcore];
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- /* sleep for 1 microsecond */
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
- /* use tpause if we have it */
- if (global_data.intrinsics_support.power_pause) {
- const uint64_t cur = rte_rdtsc();
- const uint64_t wait_tsc =
- cur + global_data.tsc_per_us;
- rte_power_pause(wait_tsc);
- } else {
- uint64_t i;
- for (i = 0; i < global_data.pause_per_us; i++)
- rte_pause();
- }
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* sleep for 1 microsecond, use tpause if we have it */
+ if (global_data.intrinsics_support.power_pause) {
+ const uint64_t cur = rte_rdtsc();
+ const uint64_t wait_tsc =
+ cur + global_data.tsc_per_us;
+ rte_power_pause(wait_tsc);
+ } else {
+ uint64_t i;
+ for (i = 0; i < global_data.pause_per_us; i++)
+ rte_pause();
}
- } else
- q_conf->empty_poll_stats = 0;
+ }
return nb_rx;
}
static uint16_t
-clb_scale_freq(uint16_t port_id, uint16_t qidx,
+clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
- uint16_t max_pkts __rte_unused, void *_ __rte_unused)
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ const bool empty = nb_rx == 0;
+ struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+ struct queue_list_entry *queue_conf = arg;
- q_conf = &port_cfg[port_id][qidx];
+ if (likely(!empty)) {
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
- /* scale down freq */
- rte_power_freq_min(rte_lcore_id());
- } else {
- q_conf->empty_poll_stats = 0;
- /* scale up freq */
+ /* scale up freq immediately */
rte_power_freq_max(rte_lcore_id());
+ } else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ rte_power_freq_min(rte_lcore_id());
}
return nb_rx;
@@ -167,11 +322,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
}
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+ const struct queue_list_entry *entry;
+
+ TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+ const union queue *q = &entry->queue;
+ int ret = queue_stopped(q->portid, q->qid);
+ if (ret != 1)
+ return ret;
+ }
+ return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+ enum power_management_env env;
+
+ /* only PSTATE and ACPI modes are supported */
+ if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+ !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+ return -ENOTSUP;
+ }
+ /* ensure we could initialize the power library */
+ if (rte_power_init(lcore))
+ return -EINVAL;
+
+ /* ensure we initialized the correct env */
+ env = rte_power_get_env();
+ if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+ struct rte_power_monitor_cond dummy;
+
+ /* check if rte_power_monitor is supported */
+ if (!global_data.intrinsics_support.power_monitor) {
+ RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+ return -ENOTSUP;
+ }
+
+ if (cfg->n_queues > 0) {
+ RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+ return -ENOTSUP;
+ }
+
+ /* check if the device supports the necessary PMD API */
+ if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+ &dummy) == -ENOTSUP) {
+ RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
struct rte_eth_dev_info info;
rte_rx_callback_fn clb;
int ret;
@@ -202,9 +426,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
+ /* if callback was already enabled, check current callback type */
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+ lcore_cfg->cb_mode != mode) {
ret = -EINVAL;
goto end;
}
@@ -214,53 +448,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
switch (mode) {
case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- struct rte_power_monitor_cond dummy;
-
- /* check if rte_power_monitor is supported */
- if (!global_data.intrinsics_support.power_monitor) {
- RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_monitor(lcore_cfg, &qdata);
+ if (ret < 0)
goto end;
- }
- /* check if the device supports the necessary PMD API */
- if (rte_eth_get_monitor_addr(port_id, queue_id,
- &dummy) == -ENOTSUP) {
- RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_umwait;
break;
- }
case RTE_POWER_MGMT_TYPE_SCALE:
- {
- enum power_management_env env;
- /* only PSTATE and ACPI modes are supported */
- if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
- !rte_power_check_env_supported(
- PM_ENV_PSTATE_CPUFREQ)) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_scale(lcore_id);
+ if (ret < 0)
goto end;
- }
- /* ensure we could initialize the power library */
- if (rte_power_init(lcore_id)) {
- ret = -EINVAL;
- goto end;
- }
- /* ensure we initialized the correct env */
- env = rte_power_get_env();
- if (env != PM_ENV_ACPI_CPUFREQ &&
- env != PM_ENV_PSTATE_CPUFREQ) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_scale_freq;
break;
- }
case RTE_POWER_MGMT_TYPE_PAUSE:
/* figure out various time-to-tsc conversions */
if (global_data.tsc_per_us == 0)
@@ -273,13 +474,27 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -EINVAL;
goto end;
}
+ /* add this queue to the list */
+ ret = queue_list_add(lcore_cfg, &qdata);
+ if (ret < 0) {
+ RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+ strerror(-ret));
+ goto end;
+ }
+ /* new queue is always added last */
+ queue_cfg = TAILQ_LAST(&lcore_cfg->head, queue_list_head);
+
+ /* when enabling first queue, ensure sleep target is not 0 */
+ if (lcore_cfg->n_queues == 1 && lcore_cfg->sleep_target == 0)
+ lcore_cfg->sleep_target = 1;
/* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb, NULL);
+ if (lcore_cfg->n_queues == 1) {
+ lcore_cfg->cb_mode = mode;
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ }
+ queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, queue_cfg);
ret = 0;
end:
@@ -290,7 +505,9 @@ int
rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,24 +523,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
}
/* no need to check queue id as wrong queue id would not be enabled */
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
return -EINVAL;
- /* stop any callbacks from progressing */
- queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+ /*
+ * There is no good/easy way to do this without race conditions, so we
+ * are just going to throw our hands in the air and hope that the user
+ * has read the documentation and has ensured that ports are stopped at
+ * the time we enter the API functions.
+ */
+ queue_cfg = queue_list_take(lcore_cfg, &qdata);
+ if (queue_cfg == NULL)
+ return -ENOENT;
- switch (queue_cfg->cb_mode) {
+ /* if we've removed all queues from the lists, set state to disabled */
+ if (lcore_cfg->n_queues == 0)
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+ switch (lcore_cfg->cb_mode) {
case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
break;
case RTE_POWER_MGMT_TYPE_SCALE:
rte_power_freq_max(lcore_id);
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
rte_power_exit(lcore_id);
break;
}
@@ -332,7 +565,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
* ports before calling any of these API's, so we can assume that the
* callbacks can be freed. we're intentionally casting away const-ness.
*/
- rte_free((void *)queue_cfg->cur_cb);
+ rte_free((void *)queue_cfg->cb);
+ free(queue_cfg);
return 0;
}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+ size_t i;
+
+ /* initialize all tailqs */
+ for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
+ struct pmd_core_cfg *cfg = &lcore_cfgs[i];
+ TAILQ_INIT(&cfg->head);
+ }
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 6/8] power: support callbacks for multiple Rx queues
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 6/8] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-07-09 16:00 ` Anatoly Burakov
0 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:00 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.
The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:
- Replace per-queue structures with per-lcore ones, so that any device
polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
added to the list of queues to poll, so that the callback is aware of
other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
shared between all queues polled on a particular lcore, and is only
activated when all queues in the list were polled and were determined
to have no traffic.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
is incapable of monitoring more than one address.
Also, while we're at it, update and improve the docs.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v8:
- Added a comment explaining that we want to sleep on each empty poll after
threshold has been reached
v7:
- Fix bug where initial sleep target was always set to zero
- Fix logic in handling of n_queues_ready_to_sleep counter
- Update documentation on hardware requirements
v6:
- Track each individual queue sleep status (Konstantin)
- Fix segfault (Dave)
v5:
- Remove the "power save queue" API and replace it with mechanism suggested by
Konstantin
v3:
- Move the list of supported NICs to NIC feature table
v2:
- Use a TAILQ for queues instead of a static array
- Address feedback from Konstantin
- Add additional checks for stopped queues
doc/guides/prog_guide/power_man.rst | 69 ++--
doc/guides/rel_notes/release_21_08.rst | 5 +
lib/power/rte_power_pmd_mgmt.c | 460 +++++++++++++++++++------
3 files changed, 398 insertions(+), 136 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..0e66878892 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,45 @@ Ethernet PMD Power Management API
Abstract
~~~~~~~~
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
- This power saving scheme will put the CPU into optimized power state
- and use the ``rte_power_monitor()`` function
- to monitor the Ethernet PMD RX descriptor address,
- and wake the CPU up whenever there's new traffic.
-
-Pause
- This power saving scheme will avoid busy polling
- by either entering power-optimized sleep state
- with ``rte_power_pause()`` function,
- or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
- This power saving scheme will use ``librte_power`` library
- functionality to scale the core frequency up/down
- depending on traffic volume.
-
-.. note::
-
- Currently, this power management API is limited to mandatory mapping
- of 1 queue to 1 core (multiple queues are supported,
- but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+ This power saving scheme will put the CPU into optimized power state and
+ monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+ there's new traffic. Support for this scheme may not be available on all
+ platforms, and further limitations may apply (see below).
+
+* Pause
+ This power saving scheme will avoid busy polling by either entering
+ power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+ not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+ This power saving scheme will use ``librte_power`` library functionality to
+ scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* On Linux* x86_64, `rte_power_monitor()` requires WAITPKG instruction set being
+ supported by the CPU. Please refer to your platform documentation for further
+ information.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+ limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+ monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+ ``rte_power_monitor()`` function is not supported, then monitor mode will not
+ be supported.
+
+* Not all Ethernet drivers support monitoring, even if the underlying
+ platform may support the necessary CPU instructions. Please refer to
+ :doc:`../nics/overview` for more information.
+
API Overview for Ethernet PMD Power Management
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -242,3 +253,5 @@ References
* The :doc:`../sample_app_ug/vm_power_management`
chapter in the :doc:`../sample_app_ug/index` section.
+
+* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index b9a3caabf0..ca28ebe461 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -112,6 +112,11 @@ New Features
Added support for cppc_cpufreq driver which works on most arm64 platforms.
+* **Added multi-queue support to Ethernet PMD Power Management**
+
+ The experimental PMD power management API now supports managing
+ multiple Ethernet Rx queues per lcore.
+
Removed Items
-------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..30772791af 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,18 +33,98 @@ enum pmd_mgmt_state {
PMD_MGMT_ENABLED
};
-struct pmd_queue_cfg {
+union queue {
+ uint32_t val;
+ struct {
+ uint16_t portid;
+ uint16_t qid;
+ };
+};
+
+struct queue_list_entry {
+ TAILQ_ENTRY(queue_list_entry) next;
+ union queue queue;
+ uint64_t n_empty_polls;
+ uint64_t n_sleeps;
+ const struct rte_eth_rxtx_callback *cb;
+};
+
+struct pmd_core_cfg {
+ TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+ /**< List of queues associated with this lcore */
+ size_t n_queues;
+ /**< How many queues are in the list? */
volatile enum pmd_mgmt_state pwr_mgmt_state;
/**< State of power management for this queue */
enum rte_power_pmd_mgmt_type cb_mode;
/**< Callback mode for this queue */
- const struct rte_eth_rxtx_callback *cur_cb;
- /**< Callback instance */
- uint64_t empty_poll_stats;
- /**< Number of empty polls */
+ uint64_t n_queues_ready_to_sleep;
+ /**< Number of queues ready to enter power optimized state */
+ uint64_t sleep_target;
+ /**< Prevent a queue from triggering sleep multiple times */
} __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+ return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+ dst->val = src->val;
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *cur;
+
+ TAILQ_FOREACH(cur, &cfg->head, next) {
+ if (queue_equal(&cur->queue, q))
+ return cur;
+ }
+ return NULL;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *qle;
+
+ /* is it already in the list? */
+ if (queue_list_find(cfg, q) != NULL)
+ return -EEXIST;
+
+ qle = malloc(sizeof(*qle));
+ if (qle == NULL)
+ return -ENOMEM;
+ memset(qle, 0, sizeof(*qle));
+
+ queue_copy(&qle->queue, q);
+ TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+ cfg->n_queues++;
+
+ return 0;
+}
+
+static struct queue_list_entry *
+queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *found;
+
+ found = queue_list_find(cfg, q);
+ if (found == NULL)
+ return NULL;
+
+ TAILQ_REMOVE(&cfg->head, found, next);
+ cfg->n_queues--;
+
+ /* freeing is responsibility of the caller */
+ return found;
+}
static void
calc_tsc(void)
@@ -74,21 +154,79 @@ calc_tsc(void)
}
}
+static inline void
+queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ const bool is_ready_to_sleep = qcfg->n_sleeps == cfg->sleep_target;
+
+ /* reset empty poll counter for this queue */
+ qcfg->n_empty_polls = 0;
+ /* reset the queue sleep counter as well */
+ qcfg->n_sleeps = 0;
+ /* remove the queue from list of queues ready to sleep */
+ if (is_ready_to_sleep)
+ cfg->n_queues_ready_to_sleep--;
+ /*
+ * no need change the lcore sleep target counter because this lcore will
+ * reach the n_sleeps anyway, and the other cores are already counted so
+ * there's no need to do anything else.
+ */
+}
+
+static inline bool
+queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ /* this function is called - that means we have an empty poll */
+ qcfg->n_empty_polls++;
+
+ /* if we haven't reached threshold for empty polls, we can't sleep */
+ if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
+ return false;
+
+ /*
+ * we've reached a point where we are able to sleep, but we still need
+ * to check if this queue has already been marked for sleeping.
+ */
+ if (qcfg->n_sleeps == cfg->sleep_target)
+ return true;
+
+ /* mark this queue as ready for sleep */
+ qcfg->n_sleeps = cfg->sleep_target;
+ cfg->n_queues_ready_to_sleep++;
+
+ return true;
+}
+
+static inline bool
+lcore_can_sleep(struct pmd_core_cfg *cfg)
+{
+ /* are all queues ready to sleep? */
+ if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
+ return false;
+
+ /* we've reached an iteration where we can sleep, reset sleep counter */
+ cfg->n_queues_ready_to_sleep = 0;
+ cfg->sleep_target++;
+ /*
+ * we do not reset any individual queue empty poll counters, because
+ * we want to keep sleeping on every poll until we actually get traffic.
+ */
+
+ return true;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+ uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
{
+ struct queue_list_entry *queue_conf = arg;
- struct pmd_queue_cfg *q_conf;
-
- q_conf = &port_cfg[port_id][qidx];
-
+ /* this callback can't do more than one queue, omit multiqueue logic */
if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+ queue_conf->n_empty_polls++;
+ if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
struct rte_power_monitor_cond pmc;
- uint16_t ret;
+ int ret;
/* use monitoring condition to sleep */
ret = rte_eth_get_monitor_addr(port_id, qidx,
@@ -97,60 +235,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
rte_power_monitor(&pmc, UINT64_MAX);
}
} else
- q_conf->empty_poll_stats = 0;
+ queue_conf->n_empty_polls = 0;
return nb_rx;
}
static uint16_t
-clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
- q_conf = &port_cfg[port_id][qidx];
+ lcore_conf = &lcore_cfgs[lcore];
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- /* sleep for 1 microsecond */
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
- /* use tpause if we have it */
- if (global_data.intrinsics_support.power_pause) {
- const uint64_t cur = rte_rdtsc();
- const uint64_t wait_tsc =
- cur + global_data.tsc_per_us;
- rte_power_pause(wait_tsc);
- } else {
- uint64_t i;
- for (i = 0; i < global_data.pause_per_us; i++)
- rte_pause();
- }
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* sleep for 1 microsecond, use tpause if we have it */
+ if (global_data.intrinsics_support.power_pause) {
+ const uint64_t cur = rte_rdtsc();
+ const uint64_t wait_tsc =
+ cur + global_data.tsc_per_us;
+ rte_power_pause(wait_tsc);
+ } else {
+ uint64_t i;
+ for (i = 0; i < global_data.pause_per_us; i++)
+ rte_pause();
}
- } else
- q_conf->empty_poll_stats = 0;
+ }
return nb_rx;
}
static uint16_t
-clb_scale_freq(uint16_t port_id, uint16_t qidx,
+clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
- uint16_t max_pkts __rte_unused, void *_ __rte_unused)
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ const bool empty = nb_rx == 0;
+ struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+ struct queue_list_entry *queue_conf = arg;
- q_conf = &port_cfg[port_id][qidx];
+ if (likely(!empty)) {
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
- /* scale down freq */
- rte_power_freq_min(rte_lcore_id());
- } else {
- q_conf->empty_poll_stats = 0;
- /* scale up freq */
+ /* scale up freq immediately */
rte_power_freq_max(rte_lcore_id());
+ } else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ rte_power_freq_min(rte_lcore_id());
}
return nb_rx;
@@ -167,11 +322,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
}
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+ const struct queue_list_entry *entry;
+
+ TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+ const union queue *q = &entry->queue;
+ int ret = queue_stopped(q->portid, q->qid);
+ if (ret != 1)
+ return ret;
+ }
+ return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+ enum power_management_env env;
+
+ /* only PSTATE and ACPI modes are supported */
+ if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+ !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+ return -ENOTSUP;
+ }
+ /* ensure we could initialize the power library */
+ if (rte_power_init(lcore))
+ return -EINVAL;
+
+ /* ensure we initialized the correct env */
+ env = rte_power_get_env();
+ if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+ struct rte_power_monitor_cond dummy;
+
+ /* check if rte_power_monitor is supported */
+ if (!global_data.intrinsics_support.power_monitor) {
+ RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+ return -ENOTSUP;
+ }
+
+ if (cfg->n_queues > 0) {
+ RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+ return -ENOTSUP;
+ }
+
+ /* check if the device supports the necessary PMD API */
+ if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+ &dummy) == -ENOTSUP) {
+ RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
struct rte_eth_dev_info info;
rte_rx_callback_fn clb;
int ret;
@@ -202,9 +426,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
+ /* if callback was already enabled, check current callback type */
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+ lcore_cfg->cb_mode != mode) {
ret = -EINVAL;
goto end;
}
@@ -214,53 +448,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
switch (mode) {
case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- struct rte_power_monitor_cond dummy;
-
- /* check if rte_power_monitor is supported */
- if (!global_data.intrinsics_support.power_monitor) {
- RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_monitor(lcore_cfg, &qdata);
+ if (ret < 0)
goto end;
- }
- /* check if the device supports the necessary PMD API */
- if (rte_eth_get_monitor_addr(port_id, queue_id,
- &dummy) == -ENOTSUP) {
- RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_umwait;
break;
- }
case RTE_POWER_MGMT_TYPE_SCALE:
- {
- enum power_management_env env;
- /* only PSTATE and ACPI modes are supported */
- if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
- !rte_power_check_env_supported(
- PM_ENV_PSTATE_CPUFREQ)) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_scale(lcore_id);
+ if (ret < 0)
goto end;
- }
- /* ensure we could initialize the power library */
- if (rte_power_init(lcore_id)) {
- ret = -EINVAL;
- goto end;
- }
- /* ensure we initialized the correct env */
- env = rte_power_get_env();
- if (env != PM_ENV_ACPI_CPUFREQ &&
- env != PM_ENV_PSTATE_CPUFREQ) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_scale_freq;
break;
- }
case RTE_POWER_MGMT_TYPE_PAUSE:
/* figure out various time-to-tsc conversions */
if (global_data.tsc_per_us == 0)
@@ -273,13 +474,27 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -EINVAL;
goto end;
}
+ /* add this queue to the list */
+ ret = queue_list_add(lcore_cfg, &qdata);
+ if (ret < 0) {
+ RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+ strerror(-ret));
+ goto end;
+ }
+ /* new queue is always added last */
+ queue_cfg = TAILQ_LAST(&lcore_cfg->head, queue_list_head);
+
+ /* when enabling first queue, ensure sleep target is not 0 */
+ if (lcore_cfg->n_queues == 1 && lcore_cfg->sleep_target == 0)
+ lcore_cfg->sleep_target = 1;
/* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb, NULL);
+ if (lcore_cfg->n_queues == 1) {
+ lcore_cfg->cb_mode = mode;
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ }
+ queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, queue_cfg);
ret = 0;
end:
@@ -290,7 +505,9 @@ int
rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,24 +523,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
}
/* no need to check queue id as wrong queue id would not be enabled */
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
return -EINVAL;
- /* stop any callbacks from progressing */
- queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+ /*
+ * There is no good/easy way to do this without race conditions, so we
+ * are just going to throw our hands in the air and hope that the user
+ * has read the documentation and has ensured that ports are stopped at
+ * the time we enter the API functions.
+ */
+ queue_cfg = queue_list_take(lcore_cfg, &qdata);
+ if (queue_cfg == NULL)
+ return -ENOENT;
- switch (queue_cfg->cb_mode) {
+ /* if we've removed all queues from the lists, set state to disabled */
+ if (lcore_cfg->n_queues == 0)
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+ switch (lcore_cfg->cb_mode) {
case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
break;
case RTE_POWER_MGMT_TYPE_SCALE:
rte_power_freq_max(lcore_id);
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
rte_power_exit(lcore_id);
break;
}
@@ -332,7 +565,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
* ports before calling any of these API's, so we can assume that the
* callbacks can be freed. we're intentionally casting away const-ness.
*/
- rte_free((void *)queue_cfg->cur_cb);
+ rte_free((void *)queue_cfg->cb);
+ free(queue_cfg);
return 0;
}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+ size_t i;
+
+ /* initialize all tailqs */
+ for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
+ struct pmd_core_cfg *cfg = &lcore_cfgs[i];
+ TAILQ_INIT(&cfg->head);
+ }
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 7/8] power: support monitoring multiple Rx queues
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
` (5 preceding siblings ...)
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 6/8] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-07-09 15:53 ` Anatoly Burakov
2021-07-09 16:00 ` Anatoly Burakov
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 8/8] examples/l3fwd-power: support multiq in PMD modes Anatoly Burakov
` (2 subsequent siblings)
9 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 15:53 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v6:
- Fix the missed feedback from v5
v4:
- Fix possible out of bounds access
- Added missing index increment
doc/guides/prog_guide/power_man.rst | 15 ++++--
lib/power/rte_power_pmd_mgmt.c | 82 ++++++++++++++++++++++++++++-
2 files changed, 90 insertions(+), 7 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0e66878892..e387d7811e 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,17 +221,22 @@ power saving whenever empty poll count reaches a certain number.
The "monitor" mode is only supported in the following configurations and scenarios:
* On Linux* x86_64, `rte_power_monitor()` requires WAITPKG instruction set being
- supported by the CPU. Please refer to your platform documentation for further
- information.
+ supported by the CPU, while `rte_power_monitor_multi()` requires WAITPKG and
+ RTM instruction sets being supported by the CPU. RTM instruction set may also
+ require booting the Linux with `tsx=on` command line parameter. Please refer
+ to your platform documentation for further information.
* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor_multi()`` function is supported by the platform, then
+ monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
``rte_power_monitor()`` is supported by the platform, then monitoring will be
limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
monitored from a different lcore).
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
- ``rte_power_monitor()`` function is not supported, then monitor mode will not
- be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+ two monitoring functions are supported, then monitor mode will not be supported.
* Not all Ethernet drivers support monitoring, even if the underlying
platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 30772791af..2586204b93 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -126,6 +126,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
return found;
}
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+ struct rte_power_monitor_cond *pmc, size_t len)
+{
+ const struct queue_list_entry *qle;
+ size_t i = 0;
+ int ret;
+
+ TAILQ_FOREACH(qle, &cfg->head, next) {
+ const union queue *q = &qle->queue;
+ struct rte_power_monitor_cond *cur;
+
+ /* attempted out of bounds access */
+ if (i >= len) {
+ RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
+ return -1;
+ }
+
+ cur = &pmc[i++];
+ ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
static void
calc_tsc(void)
{
@@ -215,6 +241,46 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
return true;
}
+static uint16_t
+clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
+{
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
+
+ lcore_conf = &lcore_cfgs[lcore];
+
+ /* early exit */
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ struct rte_power_monitor_cond pmc[lcore_conf->n_queues];
+ int ret;
+
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* gather all monitoring conditions */
+ ret = get_monitor_addresses(lcore_conf, pmc,
+ lcore_conf->n_queues);
+ if (ret < 0)
+ return nb_rx;
+
+ rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX);
+ }
+
+ return nb_rx;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
@@ -366,14 +432,19 @@ static int
check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
{
struct rte_power_monitor_cond dummy;
+ bool multimonitor_supported;
/* check if rte_power_monitor is supported */
if (!global_data.intrinsics_support.power_monitor) {
RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
return -ENOTSUP;
}
+ /* check if multi-monitor is supported */
+ multimonitor_supported =
+ global_data.intrinsics_support.power_monitor_multi;
- if (cfg->n_queues > 0) {
+ /* if we're adding a new queue, do we support multiple queues? */
+ if (cfg->n_queues > 0 && !multimonitor_supported) {
RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
return -ENOTSUP;
}
@@ -389,6 +460,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
return 0;
}
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+ return global_data.intrinsics_support.power_monitor_multi ?
+ clb_multiwait : clb_umwait;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -453,7 +531,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (ret < 0)
goto end;
- clb = clb_umwait;
+ clb = get_monitor_callback();
break;
case RTE_POWER_MGMT_TYPE_SCALE:
/* check if we can add a new queue */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 7/8] power: support monitoring multiple Rx queues
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 7/8] power: support monitoring " Anatoly Burakov
@ 2021-07-09 16:00 ` Anatoly Burakov
0 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:00 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v6:
- Fix the missed feedback from v5
v4:
- Fix possible out of bounds access
- Added missing index increment
doc/guides/prog_guide/power_man.rst | 15 ++++--
lib/power/rte_power_pmd_mgmt.c | 82 ++++++++++++++++++++++++++++-
2 files changed, 90 insertions(+), 7 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0e66878892..e387d7811e 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,17 +221,22 @@ power saving whenever empty poll count reaches a certain number.
The "monitor" mode is only supported in the following configurations and scenarios:
* On Linux* x86_64, `rte_power_monitor()` requires WAITPKG instruction set being
- supported by the CPU. Please refer to your platform documentation for further
- information.
+ supported by the CPU, while `rte_power_monitor_multi()` requires WAITPKG and
+ RTM instruction sets being supported by the CPU. RTM instruction set may also
+ require booting the Linux with `tsx=on` command line parameter. Please refer
+ to your platform documentation for further information.
* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor_multi()`` function is supported by the platform, then
+ monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
``rte_power_monitor()`` is supported by the platform, then monitoring will be
limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
monitored from a different lcore).
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
- ``rte_power_monitor()`` function is not supported, then monitor mode will not
- be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+ two monitoring functions are supported, then monitor mode will not be supported.
* Not all Ethernet drivers support monitoring, even if the underlying
platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 30772791af..2586204b93 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -126,6 +126,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
return found;
}
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+ struct rte_power_monitor_cond *pmc, size_t len)
+{
+ const struct queue_list_entry *qle;
+ size_t i = 0;
+ int ret;
+
+ TAILQ_FOREACH(qle, &cfg->head, next) {
+ const union queue *q = &qle->queue;
+ struct rte_power_monitor_cond *cur;
+
+ /* attempted out of bounds access */
+ if (i >= len) {
+ RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
+ return -1;
+ }
+
+ cur = &pmc[i++];
+ ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
static void
calc_tsc(void)
{
@@ -215,6 +241,46 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
return true;
}
+static uint16_t
+clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
+{
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
+
+ lcore_conf = &lcore_cfgs[lcore];
+
+ /* early exit */
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ struct rte_power_monitor_cond pmc[lcore_conf->n_queues];
+ int ret;
+
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* gather all monitoring conditions */
+ ret = get_monitor_addresses(lcore_conf, pmc,
+ lcore_conf->n_queues);
+ if (ret < 0)
+ return nb_rx;
+
+ rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX);
+ }
+
+ return nb_rx;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
@@ -366,14 +432,19 @@ static int
check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
{
struct rte_power_monitor_cond dummy;
+ bool multimonitor_supported;
/* check if rte_power_monitor is supported */
if (!global_data.intrinsics_support.power_monitor) {
RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
return -ENOTSUP;
}
+ /* check if multi-monitor is supported */
+ multimonitor_supported =
+ global_data.intrinsics_support.power_monitor_multi;
- if (cfg->n_queues > 0) {
+ /* if we're adding a new queue, do we support multiple queues? */
+ if (cfg->n_queues > 0 && !multimonitor_supported) {
RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
return -ENOTSUP;
}
@@ -389,6 +460,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
return 0;
}
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+ return global_data.intrinsics_support.power_monitor_multi ?
+ clb_multiwait : clb_umwait;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -453,7 +531,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (ret < 0)
goto end;
- clb = clb_umwait;
+ clb = get_monitor_callback();
break;
case RTE_POWER_MGMT_TYPE_SCALE:
/* check if we can add a new queue */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 8/8] examples/l3fwd-power: support multiq in PMD modes
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
` (6 preceding siblings ...)
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 7/8] power: support monitoring " Anatoly Burakov
@ 2021-07-09 15:53 ` Anatoly Burakov
2021-07-09 16:00 ` Anatoly Burakov
2021-07-09 16:00 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
9 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 15:53 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
examples/l3fwd-power/main.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..52f56dc405 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2723,12 +2723,6 @@ main(int argc, char **argv)
printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
fflush(stdout);
- /* PMD power management mode can only do 1 queue per core */
- if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
- rte_exit(EXIT_FAILURE,
- "In PMD power management mode, only one queue per lcore is allowed\n");
- }
-
/* init RX queues */
for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
struct rte_eth_rxconf rxq_conf;
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 8/8] examples/l3fwd-power: support multiq in PMD modes
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 8/8] examples/l3fwd-power: support multiq in PMD modes Anatoly Burakov
@ 2021-07-09 16:00 ` Anatoly Burakov
0 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:00 UTC (permalink / raw)
To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev
Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
examples/l3fwd-power/main.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..52f56dc405 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2723,12 +2723,6 @@ main(int argc, char **argv)
printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
fflush(stdout);
- /* PMD power management mode can only do 1 queue per core */
- if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
- rte_exit(EXIT_FAILURE,
- "In PMD power management mode, only one queue per lcore is allowed\n");
- }
-
/* init RX queues */
for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
struct rte_eth_rxconf rxq_conf;
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
` (7 preceding siblings ...)
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 8/8] examples/l3fwd-power: support multiq in PMD modes Anatoly Burakov
@ 2021-07-09 16:00 ` Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
9 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:00 UTC (permalink / raw)
To: dev; +Cc: david.hunt, ciara.loftus, konstantin.ananyev
This patchset introduces several changes related to PMD power management:
- Changed monitoring intrinsics to use callbacks as a comparison function, based
on previous patchset [1] but incorporating feedback [2] - this hopefully will
make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
accompanying infrastructure and example apps changes
v9:
- Added all missing Acks and Tests
- Added a new commit with NIC features
- Addressed minor issues raised in review
v8:
- Fixed checkpatch issue
- Added comment explaining empty poll handling (Konstantin)
v7:
- Fixed various bugs
v6:
- Improved the algorithm for multi-queue sleep
- Fixed segfault and addressed other feedback
v5:
- Removed "power save queue" API and replaced with mechanism suggested by
Konstantin
- Addressed other feedback
v4:
- Replaced raw number with a macro
- Fixed all the bugs found by Konstantin
- Some other minor corrections
v3:
- Moved some doc updates to NIC features list
v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary
[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274
Anatoly Burakov (8):
eal: use callbacks for power monitoring comparison
net/af_xdp: add power monitor support
doc: add PMD power management NIC feature
eal: add power monitor for multiple events
power: remove thread safety from PMD power API's
power: support callbacks for multiple Rx queues
power: support monitoring multiple Rx queues
examples/l3fwd-power: support multiq in PMD modes
doc/guides/nics/features.rst | 10 +
doc/guides/nics/features/default.ini | 1 +
doc/guides/prog_guide/power_man.rst | 74 +-
doc/guides/rel_notes/release_21_08.rst | 11 +
drivers/event/dlb2/dlb2.c | 17 +-
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +
drivers/net/i40e/i40e_rxtx.c | 20 +-
drivers/net/iavf/iavf_rxtx.c | 20 +-
drivers/net/ice/ice_rxtx.c | 20 +-
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +-
drivers/net/mlx5/mlx5_rx.c | 17 +-
examples/l3fwd-power/main.c | 6 -
lib/eal/arm/rte_power_intrinsics.c | 11 +
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 68 +-
lib/eal/ppc/rte_power_intrinsics.c | 11 +
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 90 ++-
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 663 +++++++++++++-----
lib/power/rte_power_pmd_mgmt.h | 6 +
22 files changed, 847 insertions(+), 262 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v10 0/8] Enhancements for PMD power management
2021-07-09 15:53 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
` (8 preceding siblings ...)
2021-07-09 16:00 ` [dpdk-dev] [PATCH v9 0/8] Enhancements for PMD power management Anatoly Burakov
@ 2021-07-09 16:08 ` Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 1/8] eal: use callbacks for power monitoring comparison Anatoly Burakov
` (8 more replies)
9 siblings, 9 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:08 UTC (permalink / raw)
To: dev; +Cc: david.hunt, konstantin.ananyev, ciara.loftus
This patchset introduces several changes related to PMD power management:
- Changed monitoring intrinsics to use callbacks as a comparison function, based
on previous patchset [1] but incorporating feedback [2] - this hopefully will
make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
accompanying infrastructure and example apps changes
v10:
- Added missing changes to NIC feature .ini files
v9:
- Added all missing Acks and Tests
- Added a new commit with NIC features
- Addressed minor issues raised in review
v8:
- Fixed checkpatch issue
- Added comment explaining empty poll handling (Konstantin)
v7:
- Fixed various bugs
v6:
- Improved the algorithm for multi-queue sleep
- Fixed segfault and addressed other feedback
v5:
- Removed "power save queue" API and replaced with mechanism suggested by
Konstantin
- Addressed other feedback
v4:
- Replaced raw number with a macro
- Fixed all the bugs found by Konstantin
- Some other minor corrections
v3:
- Moved some doc updates to NIC features list
v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary
[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274
Anatoly Burakov (8):
eal: use callbacks for power monitoring comparison
net/af_xdp: add power monitor support
doc: add PMD power management NIC feature
eal: add power monitor for multiple events
power: remove thread safety from PMD power API's
power: support callbacks for multiple Rx queues
power: support monitoring multiple Rx queues
examples/l3fwd-power: support multiq in PMD modes
doc/guides/nics/features.rst | 10 +
doc/guides/nics/features/af_xdp.ini | 1 +
doc/guides/nics/features/default.ini | 1 +
doc/guides/nics/features/i40e.ini | 1 +
doc/guides/nics/features/i40e_vf.ini | 1 +
doc/guides/nics/features/iavf.ini | 1 +
doc/guides/nics/features/ice.ini | 1 +
doc/guides/nics/features/ixgbe.ini | 1 +
doc/guides/nics/features/ixgbe_vf.ini | 1 +
doc/guides/nics/features/mlx5.ini | 1 +
doc/guides/prog_guide/power_man.rst | 74 +-
doc/guides/rel_notes/release_21_08.rst | 11 +
drivers/event/dlb2/dlb2.c | 17 +-
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +
drivers/net/i40e/i40e_rxtx.c | 20 +-
drivers/net/iavf/iavf_rxtx.c | 20 +-
drivers/net/ice/ice_rxtx.c | 20 +-
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +-
drivers/net/mlx5/mlx5_rx.c | 17 +-
examples/l3fwd-power/main.c | 6 -
lib/eal/arm/rte_power_intrinsics.c | 11 +
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 68 +-
lib/eal/ppc/rte_power_intrinsics.c | 11 +
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 90 ++-
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 663 +++++++++++++-----
lib/power/rte_power_pmd_mgmt.h | 6 +
30 files changed, 855 insertions(+), 262 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v10 1/8] eal: use callbacks for power monitoring comparison
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
@ 2021-07-09 16:08 ` Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 2/8] net/af_xdp: add power monitor support Anatoly Burakov
` (7 subsequent siblings)
8 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:08 UTC (permalink / raw)
To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.
This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.
Existing implementations are adjusted to follow the new semantics.
Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
Acked-by: Timothy McDaniel <timothy.mcdaniel@intel.com>
---
Notes:
v4:
- Return error if callback is set to NULL
- Replace raw number with a macro in monitor condition opaque data
v2:
- Use callback mechanism for more flexibility
- Address feedback from Konstantin
doc/guides/rel_notes/release_21_08.rst | 2 ++
drivers/event/dlb2/dlb2.c | 17 ++++++++--
drivers/net/i40e/i40e_rxtx.c | 20 +++++++----
drivers/net/iavf/iavf_rxtx.c | 20 +++++++----
drivers/net/ice/ice_rxtx.c | 20 +++++++----
drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++----
drivers/net/mlx5/mlx5_rx.c | 17 ++++++++--
.../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++----
lib/eal/x86/rte_power_intrinsics.c | 17 +++++-----
9 files changed, 122 insertions(+), 44 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 476822b47f..912fb13b84 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -144,6 +144,8 @@ API Changes
* eal: ``rte_strscpy`` sets ``rte_errno`` to ``E2BIG`` in case of string
truncation.
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+
ABI Changes
-----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
}
}
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ /* abort if the value matches */
+ return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
static inline int
dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
expected_value = 0;
pmc.addr = monitor_addr;
- pmc.val = expected_value;
- pmc.mask = qe_mask.raw_qe[1];
+ /* store expected value and comparison mask in opaque data */
+ pmc.opaque[CLB_VAL_IDX] = expected_value;
+ pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+ /* set up callback */
+ pmc.fn = dlb2_monitor_callback;
pmc.size = sizeof(uint64_t);
rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index e518409fe5..8489f91f1d 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
#define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
+static int
+i40e_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = i40e_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index f817fbc49b..d61b32fcee 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
}
+static int
+iavf_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.qword1.status_error_len;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
- pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+ /* comparison callback */
+ pmc->fn = iavf_monitor_callback;
/* registers are 64-bit */
pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index 3f6e735984..5d7ab4f047 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
+static int
+ice_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.status_error0;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
- pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+ /* comparison callback */
+ pmc->fn = ice_monitor_callback;
/* register is 16-bit */
pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
};
+static int
+ixgbe_monitor_callback(const uint64_t value,
+ const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+ const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /*
+ * we expect the DD bit to be set to 1 if this descriptor was already
+ * written to.
+ */
+ return (value & m) == m ? -1 : 0;
+}
+
int
ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
/* watch for changes in status bit */
pmc->addr = &rxdp->wb.upper.status_error;
- /*
- * we expect the DD bit to be set to 1 if this descriptor was already
- * written to.
- */
- pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
- pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+ /* comparison callback */
+ pmc->fn = ixgbe_monitor_callback;
/* the registers are 32-bit */
pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..8d47637892 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
return rx_queue_count(rxq);
}
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx5_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t m = opaque[CLB_MSK_IDX];
+ const uint64_t v = opaque[CLB_VAL_IDX];
+
+ return (value & m) == v ? -1 : 0;
+}
+
int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
{
struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
return -rte_errno;
}
pmc->addr = &cqe->op_own;
- pmc->val = !!idx;
- pmc->mask = MLX5_CQE_OWNER_MASK;
+ pmc->opaque[CLB_VAL_IDX] = !!idx;
+ pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+ pmc->fn = mlx5_monitor_callback;
pmc->size = sizeof(uint8_t);
return 0;
}
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
* which are architecture-dependent.
*/
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ * The value read from memory.
+ * @param opaque
+ * Callback-specific data.
+ *
+ * @return
+ * 0 if entering of power optimized state should proceed
+ * -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
struct rte_power_monitor_cond {
volatile void *addr; /**< Address to monitor for changes */
- uint64_t val; /**< If the `mask` is non-zero, location pointed
- * to by `addr` will be read and compared
- * against this value.
- */
- uint64_t mask; /**< 64-bit mask to extract value read from `addr` */
- uint8_t size; /**< Data size (in bytes) that will be used to compare
- * expected value (`val`) with data read from the
+ uint8_t size; /**< Data size (in bytes) that will be read from the
* monitored memory location (`addr`). Can be 1, 2,
* 4, or 8. Supplying any other value will result in
* an error.
*/
+ rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+ * entering power optimized state should
+ * be aborted.
+ */
+ uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+ /**< Callback-specific data */
};
/**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
const unsigned int lcore_id = rte_lcore_id();
struct power_wait_status *s;
+ uint64_t cur_value;
/* prevent user from running this instruction if it's not supported */
if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
if (__check_val_size(pmc->size) < 0)
return -EINVAL;
+ if (pmc->fn == NULL)
+ return -EINVAL;
+
s = &wait_status[lcore_id];
/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
/* now that we've put this address into monitor, we can unlock */
rte_spinlock_unlock(&s->lock);
- /* if we have a comparison mask, we might not need to sleep at all */
- if (pmc->mask) {
- const uint64_t cur_value = __get_umwait_val(
- pmc->addr, pmc->size);
- const uint64_t masked = cur_value & pmc->mask;
+ cur_value = __get_umwait_val(pmc->addr, pmc->size);
- /* if the masked value is already matching, abort */
- if (masked == pmc->val)
- goto end;
- }
+ /* check if callback indicates we should abort */
+ if (pmc->fn(cur_value, pmc->opaque) != 0)
+ goto end;
/* execute UMWAIT */
asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v10 2/8] net/af_xdp: add power monitor support
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 1/8] eal: use callbacks for power monitoring comparison Anatoly Burakov
@ 2021-07-09 16:08 ` Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 3/8] doc: add PMD power management NIC feature Anatoly Burakov
` (6 subsequent siblings)
8 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:08 UTC (permalink / raw)
To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt, konstantin.ananyev
Implement support for .get_monitor_addr in AF_XDP driver.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v8:
- Fix checkpatch issue
v2:
- Rewrite using the callback mechanism
drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..989051dd6d 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
#include <rte_malloc.h>
#include <rte_ring.h>
#include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
#include "compat.h"
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
return 0;
}
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+ const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+ const uint64_t v = opaque[CLB_VAL_IDX];
+ const uint64_t m = (uint32_t)~0;
+
+ /* if the value has changed, abort entering power optimized state */
+ return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+ struct pkt_rx_queue *rxq = rx_queue;
+ unsigned int *prod = rxq->rx.producer;
+ const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+ /* watch for changes in producer ring */
+ pmc->addr = (void *)prod;
+
+ /* store current value */
+ pmc->opaque[CLB_VAL_IDX] = cur_val;
+ pmc->fn = eth_monitor_callback;
+
+ /* AF_XDP producer ring index is 32-bit */
+ pmc->size = sizeof(uint32_t);
+
+ return 0;
+}
+
static int
eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
{
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
.link_update = eth_link_update,
.stats_get = eth_stats_get,
.stats_reset = eth_stats_reset,
+ .get_monitor_addr = eth_get_monitor_addr
};
/** parse busy_budget argument */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v10 3/8] doc: add PMD power management NIC feature
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 1/8] eal: use callbacks for power monitoring comparison Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 2/8] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-07-09 16:08 ` Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 4/8] eal: add power monitor for multiple events Anatoly Burakov
` (5 subsequent siblings)
8 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:08 UTC (permalink / raw)
To: dev, Ciara Loftus, Qi Zhang, Ferruh Yigit, Beilei Xing,
Jingjing Wu, Qiming Yang, Haiyue Wang, Matan Azrad,
Shahaf Shuler, Viacheslav Ovsiienko
Cc: david.hunt, konstantin.ananyev, David Marchand
At this point, multiple different Ethernet drivers from multiple vendors
will support the PMD power management scheme. It would be useful to add
it to the NIC feature table to indicate support for it.
Suggested-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
Notes:
v10:
- Added missing NIC feature support in ini files
doc/guides/nics/features.rst | 10 ++++++++++
doc/guides/nics/features/af_xdp.ini | 1 +
doc/guides/nics/features/default.ini | 1 +
doc/guides/nics/features/i40e.ini | 1 +
doc/guides/nics/features/i40e_vf.ini | 1 +
doc/guides/nics/features/iavf.ini | 1 +
doc/guides/nics/features/ice.ini | 1 +
doc/guides/nics/features/ixgbe.ini | 1 +
doc/guides/nics/features/ixgbe_vf.ini | 1 +
doc/guides/nics/features/mlx5.ini | 1 +
10 files changed, 19 insertions(+)
diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
* **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
* **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
.. _nic_features_other:
Other dev ops not represented by a Feature
diff --git a/doc/guides/nics/features/af_xdp.ini b/doc/guides/nics/features/af_xdp.ini
index 36953c2dec..4e3f638bf5 100644
--- a/doc/guides/nics/features/af_xdp.ini
+++ b/doc/guides/nics/features/af_xdp.ini
@@ -9,3 +9,4 @@ MTU update = Y
Promiscuous mode = Y
Stats per queue = Y
x86-64 = Y
+Power mgmt address monitor = Y
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 3b55e0ccb0..f1e947bd9e 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -76,6 +76,7 @@ x86-64 =
Usage doc =
Design doc =
Perf doc =
+Power mgmt address monitor =
[rte_flow items]
ah =
diff --git a/doc/guides/nics/features/i40e.ini b/doc/guides/nics/features/i40e.ini
index 1f3f5eb3ff..b6765d0e5a 100644
--- a/doc/guides/nics/features/i40e.ini
+++ b/doc/guides/nics/features/i40e.ini
@@ -51,6 +51,7 @@ x86-32 = Y
x86-64 = Y
ARMv8 = Y
Power8 = Y
+Power mgmt address monitor = Y
[rte_flow items]
ah = Y
diff --git a/doc/guides/nics/features/i40e_vf.ini b/doc/guides/nics/features/i40e_vf.ini
index bac1bb4344..d5b163c1c1 100644
--- a/doc/guides/nics/features/i40e_vf.ini
+++ b/doc/guides/nics/features/i40e_vf.ini
@@ -37,3 +37,4 @@ FreeBSD = Y
Linux = Y
x86-32 = Y
x86-64 = Y
+Power mgmt address monitor = Y
diff --git a/doc/guides/nics/features/iavf.ini b/doc/guides/nics/features/iavf.ini
index 43a84a3bda..146b004da2 100644
--- a/doc/guides/nics/features/iavf.ini
+++ b/doc/guides/nics/features/iavf.ini
@@ -33,6 +33,7 @@ FreeBSD = Y
Linux = Y
x86-32 = Y
x86-64 = Y
+Power mgmt address monitor = Y
[rte_flow items]
ah = Y
diff --git a/doc/guides/nics/features/ice.ini b/doc/guides/nics/features/ice.ini
index 1b9228c678..fbc81c654d 100644
--- a/doc/guides/nics/features/ice.ini
+++ b/doc/guides/nics/features/ice.ini
@@ -42,6 +42,7 @@ Linux = Y
Windows = Y
x86-32 = Y
x86-64 = Y
+Power mgmt address monitor = Y
[rte_flow items]
ah = Y
diff --git a/doc/guides/nics/features/ixgbe.ini b/doc/guides/nics/features/ixgbe.ini
index 93a9cc18ab..92228fe194 100644
--- a/doc/guides/nics/features/ixgbe.ini
+++ b/doc/guides/nics/features/ixgbe.ini
@@ -54,6 +54,7 @@ Linux = Y
ARMv8 = Y
x86-32 = Y
x86-64 = Y
+Power mgmt address monitor = Y
[rte_flow items]
eth = Y
diff --git a/doc/guides/nics/features/ixgbe_vf.ini b/doc/guides/nics/features/ixgbe_vf.ini
index 7161e61f9a..ea8342f2c9 100644
--- a/doc/guides/nics/features/ixgbe_vf.ini
+++ b/doc/guides/nics/features/ixgbe_vf.ini
@@ -38,3 +38,4 @@ Linux = Y
ARMv8 = Y
x86-32 = Y
x86-64 = Y
+Power mgmt address monitor = Y
diff --git a/doc/guides/nics/features/mlx5.ini b/doc/guides/nics/features/mlx5.ini
index 3b82ce41fd..2c7d9f6e8c 100644
--- a/doc/guides/nics/features/mlx5.ini
+++ b/doc/guides/nics/features/mlx5.ini
@@ -51,6 +51,7 @@ Power8 = Y
x86-32 = Y
x86-64 = Y
Usage doc = Y
+Power mgmt address monitor = Y
[rte_flow items]
conntrack = Y
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v10 4/8] eal: add power monitor for multiple events
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
` (2 preceding siblings ...)
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 3/8] doc: add PMD power management NIC feature Anatoly Burakov
@ 2021-07-09 16:08 ` Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 5/8] power: remove thread safety from PMD power API's Anatoly Burakov
` (4 subsequent siblings)
8 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:08 UTC (permalink / raw)
To: dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
Cc: david.hunt, ciara.loftus
Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v4:
- Fixed bugs in accessing the monitor condition
- Abort on any monitor condition not having a defined callback
v2:
- Adapt to callback mechanism
lib/eal/arm/rte_power_intrinsics.c | 11 +++
lib/eal/include/generic/rte_cpuflags.h | 2 +
.../include/generic/rte_power_intrinsics.h | 35 +++++++++
lib/eal/ppc/rte_power_intrinsics.c | 11 +++
lib/eal/version.map | 3 +
lib/eal/x86/rte_cpuflags.c | 2 +
lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++
7 files changed, 137 insertions(+)
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
/**< indicates support for rte_power_monitor function */
uint32_t power_pause : 1;
/**< indicates support for rte_power_pause function */
+ uint32_t power_monitor_multi : 1;
+ /**< indicates support for rte_power_monitor_multi function */
};
/**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
__rte_experimental
int rte_power_pause(const uint64_t tsc_timestamp);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ * supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ * Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ * An array of monitoring condition structures.
+ * @param num
+ * Length of the `pmc` array.
+ * @param tsc_timestamp
+ * Maximum TSC timestamp to wait for. Note that the wait behavior is
+ * architecture-dependent.
+ *
+ * @return
+ * 0 on success
+ * -EINVAL on invalid parameters
+ * -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp);
+
#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return -ENOTSUP;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ RTE_SET_USED(pmc);
+ RTE_SET_USED(num);
+ RTE_SET_USED(tsc_timestamp);
+
+ return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 2df65c6903..887012d02a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
rte_version_release; # WINDOWS_NO_EXPORT
rte_version_suffix; # WINDOWS_NO_EXPORT
rte_version_year; # WINDOWS_NO_EXPORT
+
+ # added in 21.08
+ rte_power_monitor_multi; # WINDOWS_NO_EXPORT
};
INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
intrinsics->power_monitor = 1;
intrinsics->power_pause = 1;
+ if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+ intrinsics->power_monitor_multi = 1;
}
}
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
#include <rte_common.h>
#include <rte_lcore.h>
+#include <rte_rtm.h>
#include <rte_spinlock.h>
#include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
}
static bool wait_supported;
+static bool wait_multi_supported;
static inline uint64_t
__get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
if (i.power_monitor && i.power_pause)
wait_supported = 1;
+ if (i.power_monitor_multi)
+ wait_multi_supported = 1;
}
int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
* In this case, since we've already woken up, the "wakeup" was
* unneeded, and since T1 is still waiting on T2 releasing the lock, the
* wakeup address is still valid so it's perfectly safe to write it.
+ *
+ * For multi-monitor case, the act of locking will in itself trigger the
+ * wakeup, so no additional writes necessary.
*/
rte_spinlock_lock(&s->lock);
if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
return 0;
}
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+ const uint32_t num, const uint64_t tsc_timestamp)
+{
+ const unsigned int lcore_id = rte_lcore_id();
+ struct power_wait_status *s = &wait_status[lcore_id];
+ uint32_t i, rc;
+
+ /* check if supported */
+ if (!wait_multi_supported)
+ return -ENOTSUP;
+
+ if (pmc == NULL || num == 0)
+ return -EINVAL;
+
+ /* we are already inside transaction region, return */
+ if (rte_xtest() != 0)
+ return 0;
+
+ /* start new transaction region */
+ rc = rte_xbegin();
+
+ /* transaction abort, possible write to one of wait addresses */
+ if (rc != RTE_XBEGIN_STARTED)
+ return 0;
+
+ /*
+ * the mere act of reading the lock status here adds the lock to
+ * the read set. This means that when we trigger a wakeup from another
+ * thread, even if we don't have a defined wakeup address and thus don't
+ * actually cause any writes, the act of locking our lock will itself
+ * trigger the wakeup and abort the transaction.
+ */
+ rte_spinlock_is_locked(&s->lock);
+
+ /*
+ * add all addresses to wait on into transaction read-set and check if
+ * any of wakeup conditions are already met.
+ */
+ rc = 0;
+ for (i = 0; i < num; i++) {
+ const struct rte_power_monitor_cond *c = &pmc[i];
+
+ /* cannot be NULL */
+ if (c->fn == NULL) {
+ rc = -EINVAL;
+ break;
+ }
+
+ const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+ /* abort if callback indicates that we need to stop */
+ if (c->fn(val, c->opaque) != 0)
+ break;
+ }
+
+ /* none of the conditions were met, sleep until timeout */
+ if (i == num)
+ rte_power_pause(tsc_timestamp);
+
+ /* end transaction region */
+ rte_xend();
+
+ return rc;
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v10 5/8] power: remove thread safety from PMD power API's
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
` (3 preceding siblings ...)
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 4/8] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-07-09 16:08 ` Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 6/8] power: support callbacks for multiple Rx queues Anatoly Burakov
` (3 subsequent siblings)
8 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:08 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.
We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v2:
- Add check for stopped queue
- Clarified doc message
- Added release notes
doc/guides/rel_notes/release_21_08.rst | 4 +
lib/power/meson.build | 3 +
lib/power/rte_power_pmd_mgmt.c | 133 ++++++++++---------------
lib/power/rte_power_pmd_mgmt.h | 6 ++
4 files changed, 66 insertions(+), 80 deletions(-)
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 912fb13b84..b9a3caabf0 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -146,6 +146,10 @@ API Changes
* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
+* rte_power: The experimental PMD power management API is no longer considered
+ to be thread safe; all Rx queues affected by the API will now need to be
+ stopped before making any changes to the power management scheme.
+
ABI Changes
-----------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index 36e5a65874..bf937acde4 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -22,4 +22,7 @@ headers = files(
'rte_power_pmd_mgmt.h',
'rte_power_guest_channel.h',
)
+if cc.has_argument('-Wno-cast-qual')
+ cflags += '-Wno-cast-qual'
+endif
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
/**< Callback mode for this queue */
const struct rte_eth_rxtx_callback *cur_cb;
/**< Callback instance */
- volatile bool umwait_in_progress;
- /**< are we currently sleeping? */
uint64_t empty_poll_stats;
/**< Number of empty polls */
} __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
struct rte_power_monitor_cond pmc;
uint16_t ret;
- /*
- * we might get a cancellation request while being
- * inside the callback, in which case the wakeup
- * wouldn't work because it would've arrived too early.
- *
- * to get around this, we notify the other thread that
- * we're sleeping, so that it can spin until we're done.
- * unsolicited wakeups are perfectly safe.
- */
- q_conf->umwait_in_progress = true;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- /* check if we need to cancel sleep */
- if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
- /* use monitoring condition to sleep */
- ret = rte_eth_get_monitor_addr(port_id, qidx,
- &pmc);
- if (ret == 0)
- rte_power_monitor(&pmc, UINT64_MAX);
- }
- q_conf->umwait_in_progress = false;
-
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+ /* use monitoring condition to sleep */
+ ret = rte_eth_get_monitor_addr(port_id, qidx,
+ &pmc);
+ if (ret == 0)
+ rte_power_monitor(&pmc, UINT64_MAX);
}
} else
q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
return nb_rx;
}
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+ struct rte_eth_rxq_info qinfo;
+
+ if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+ return -1;
+
+ return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
struct pmd_queue_cfg *queue_cfg;
struct rte_eth_dev_info info;
+ rte_rx_callback_fn clb;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
queue_cfg = &port_cfg[port_id][queue_id];
if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->umwait_in_progress = false;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* ensure we update our state before callback starts */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_umwait, NULL);
+ clb = clb_umwait;
break;
}
case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -ENOTSUP;
goto end;
}
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
- queue_id, clb_scale_freq, NULL);
+ clb = clb_scale_freq;
break;
}
case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (global_data.tsc_per_us == 0)
calc_tsc();
- /* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
- /* this is not necessary here, but do it anyway */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb_pause, NULL);
+ clb = clb_pause;
break;
+ default:
+ RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+ ret = -EINVAL;
+ goto end;
}
+
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, NULL);
+
ret = 0;
end:
return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
struct pmd_queue_cfg *queue_cfg;
+ int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
return -EINVAL;
+ /* check if the queue is stopped */
+ ret = queue_stopped(port_id, queue_id);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
/* no need to check queue id as wrong queue id would not be enabled */
queue_cfg = &port_cfg[port_id][queue_id];
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
/* stop any callbacks from progressing */
queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
- /* ensure we update our state before continuing */
- rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
switch (queue_cfg->cb_mode) {
- case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- bool exit = false;
- do {
- /*
- * we may request cancellation while the other thread
- * has just entered the callback but hasn't started
- * sleeping yet, so keep waking it up until we know it's
- * done sleeping.
- */
- if (queue_cfg->umwait_in_progress)
- rte_power_monitor_wakeup(lcore_id);
- else
- exit = true;
- } while (!exit);
- }
- /* fall-through */
+ case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
rte_eth_remove_rx_callback(port_id, queue_id,
queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
break;
}
/*
- * we don't free the RX callback here because it is unsafe to do so
- * unless we know for a fact that all data plane threads have stopped.
+ * the API doc mandates that the user stops all processing on affected
+ * ports before calling any of these API's, so we can assume that the
+ * callbacks can be freed. we're intentionally casting away const-ness.
*/
- queue_cfg->cur_cb = NULL;
+ rte_free((void *)queue_cfg->cur_cb);
return 0;
}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue will be polled from.
* @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
*
* @note This function is not thread-safe.
*
+ * @warning This function must be called when all affected Ethernet queues are
+ * stopped and no Rx/Tx is in progress!
+ *
* @param lcore_id
* The lcore the Rx queue is polled from.
* @param port_id
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v10 6/8] power: support callbacks for multiple Rx queues
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
` (4 preceding siblings ...)
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 5/8] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-07-09 16:08 ` Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 7/8] power: support monitoring " Anatoly Burakov
` (2 subsequent siblings)
8 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:08 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.
The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:
- Replace per-queue structures with per-lcore ones, so that any device
polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
added to the list of queues to poll, so that the callback is aware of
other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
shared between all queues polled on a particular lcore, and is only
activated when all queues in the list were polled and were determined
to have no traffic.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
is incapable of monitoring more than one address.
Also, while we're at it, update and improve the docs.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v8:
- Added a comment explaining that we want to sleep on each empty poll after
threshold has been reached
v7:
- Fix bug where initial sleep target was always set to zero
- Fix logic in handling of n_queues_ready_to_sleep counter
- Update documentation on hardware requirements
v6:
- Track each individual queue sleep status (Konstantin)
- Fix segfault (Dave)
v5:
- Remove the "power save queue" API and replace it with mechanism suggested by
Konstantin
v3:
- Move the list of supported NICs to NIC feature table
v2:
- Use a TAILQ for queues instead of a static array
- Address feedback from Konstantin
- Add additional checks for stopped queues
doc/guides/prog_guide/power_man.rst | 69 ++--
doc/guides/rel_notes/release_21_08.rst | 5 +
lib/power/rte_power_pmd_mgmt.c | 460 +++++++++++++++++++------
3 files changed, 398 insertions(+), 136 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..0e66878892 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,45 @@ Ethernet PMD Power Management API
Abstract
~~~~~~~~
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
- This power saving scheme will put the CPU into optimized power state
- and use the ``rte_power_monitor()`` function
- to monitor the Ethernet PMD RX descriptor address,
- and wake the CPU up whenever there's new traffic.
-
-Pause
- This power saving scheme will avoid busy polling
- by either entering power-optimized sleep state
- with ``rte_power_pause()`` function,
- or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
- This power saving scheme will use ``librte_power`` library
- functionality to scale the core frequency up/down
- depending on traffic volume.
-
-.. note::
-
- Currently, this power management API is limited to mandatory mapping
- of 1 queue to 1 core (multiple queues are supported,
- but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+ This power saving scheme will put the CPU into optimized power state and
+ monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+ there's new traffic. Support for this scheme may not be available on all
+ platforms, and further limitations may apply (see below).
+
+* Pause
+ This power saving scheme will avoid busy polling by either entering
+ power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+ not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+ This power saving scheme will use ``librte_power`` library functionality to
+ scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* On Linux* x86_64, `rte_power_monitor()` requires WAITPKG instruction set being
+ supported by the CPU. Please refer to your platform documentation for further
+ information.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+ limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+ monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+ ``rte_power_monitor()`` function is not supported, then monitor mode will not
+ be supported.
+
+* Not all Ethernet drivers support monitoring, even if the underlying
+ platform may support the necessary CPU instructions. Please refer to
+ :doc:`../nics/overview` for more information.
+
API Overview for Ethernet PMD Power Management
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -242,3 +253,5 @@ References
* The :doc:`../sample_app_ug/vm_power_management`
chapter in the :doc:`../sample_app_ug/index` section.
+
+* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index b9a3caabf0..ca28ebe461 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -112,6 +112,11 @@ New Features
Added support for cppc_cpufreq driver which works on most arm64 platforms.
+* **Added multi-queue support to Ethernet PMD Power Management**
+
+ The experimental PMD power management API now supports managing
+ multiple Ethernet Rx queues per lcore.
+
Removed Items
-------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..30772791af 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,18 +33,98 @@ enum pmd_mgmt_state {
PMD_MGMT_ENABLED
};
-struct pmd_queue_cfg {
+union queue {
+ uint32_t val;
+ struct {
+ uint16_t portid;
+ uint16_t qid;
+ };
+};
+
+struct queue_list_entry {
+ TAILQ_ENTRY(queue_list_entry) next;
+ union queue queue;
+ uint64_t n_empty_polls;
+ uint64_t n_sleeps;
+ const struct rte_eth_rxtx_callback *cb;
+};
+
+struct pmd_core_cfg {
+ TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+ /**< List of queues associated with this lcore */
+ size_t n_queues;
+ /**< How many queues are in the list? */
volatile enum pmd_mgmt_state pwr_mgmt_state;
/**< State of power management for this queue */
enum rte_power_pmd_mgmt_type cb_mode;
/**< Callback mode for this queue */
- const struct rte_eth_rxtx_callback *cur_cb;
- /**< Callback instance */
- uint64_t empty_poll_stats;
- /**< Number of empty polls */
+ uint64_t n_queues_ready_to_sleep;
+ /**< Number of queues ready to enter power optimized state */
+ uint64_t sleep_target;
+ /**< Prevent a queue from triggering sleep multiple times */
} __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+ return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+ dst->val = src->val;
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *cur;
+
+ TAILQ_FOREACH(cur, &cfg->head, next) {
+ if (queue_equal(&cur->queue, q))
+ return cur;
+ }
+ return NULL;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *qle;
+
+ /* is it already in the list? */
+ if (queue_list_find(cfg, q) != NULL)
+ return -EEXIST;
+
+ qle = malloc(sizeof(*qle));
+ if (qle == NULL)
+ return -ENOMEM;
+ memset(qle, 0, sizeof(*qle));
+
+ queue_copy(&qle->queue, q);
+ TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+ cfg->n_queues++;
+
+ return 0;
+}
+
+static struct queue_list_entry *
+queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
+{
+ struct queue_list_entry *found;
+
+ found = queue_list_find(cfg, q);
+ if (found == NULL)
+ return NULL;
+
+ TAILQ_REMOVE(&cfg->head, found, next);
+ cfg->n_queues--;
+
+ /* freeing is responsibility of the caller */
+ return found;
+}
static void
calc_tsc(void)
@@ -74,21 +154,79 @@ calc_tsc(void)
}
}
+static inline void
+queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ const bool is_ready_to_sleep = qcfg->n_sleeps == cfg->sleep_target;
+
+ /* reset empty poll counter for this queue */
+ qcfg->n_empty_polls = 0;
+ /* reset the queue sleep counter as well */
+ qcfg->n_sleeps = 0;
+ /* remove the queue from list of queues ready to sleep */
+ if (is_ready_to_sleep)
+ cfg->n_queues_ready_to_sleep--;
+ /*
+ * no need change the lcore sleep target counter because this lcore will
+ * reach the n_sleeps anyway, and the other cores are already counted so
+ * there's no need to do anything else.
+ */
+}
+
+static inline bool
+queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+ /* this function is called - that means we have an empty poll */
+ qcfg->n_empty_polls++;
+
+ /* if we haven't reached threshold for empty polls, we can't sleep */
+ if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
+ return false;
+
+ /*
+ * we've reached a point where we are able to sleep, but we still need
+ * to check if this queue has already been marked for sleeping.
+ */
+ if (qcfg->n_sleeps == cfg->sleep_target)
+ return true;
+
+ /* mark this queue as ready for sleep */
+ qcfg->n_sleeps = cfg->sleep_target;
+ cfg->n_queues_ready_to_sleep++;
+
+ return true;
+}
+
+static inline bool
+lcore_can_sleep(struct pmd_core_cfg *cfg)
+{
+ /* are all queues ready to sleep? */
+ if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
+ return false;
+
+ /* we've reached an iteration where we can sleep, reset sleep counter */
+ cfg->n_queues_ready_to_sleep = 0;
+ cfg->sleep_target++;
+ /*
+ * we do not reset any individual queue empty poll counters, because
+ * we want to keep sleeping on every poll until we actually get traffic.
+ */
+
+ return true;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+ uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
{
+ struct queue_list_entry *queue_conf = arg;
- struct pmd_queue_cfg *q_conf;
-
- q_conf = &port_cfg[port_id][qidx];
-
+ /* this callback can't do more than one queue, omit multiqueue logic */
if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+ queue_conf->n_empty_polls++;
+ if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
struct rte_power_monitor_cond pmc;
- uint16_t ret;
+ int ret;
/* use monitoring condition to sleep */
ret = rte_eth_get_monitor_addr(port_id, qidx,
@@ -97,60 +235,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
rte_power_monitor(&pmc, UINT64_MAX);
}
} else
- q_conf->empty_poll_stats = 0;
+ queue_conf->n_empty_polls = 0;
return nb_rx;
}
static uint16_t
-clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
- uint16_t nb_rx, uint16_t max_pkts __rte_unused,
- void *addr __rte_unused)
+clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
- q_conf = &port_cfg[port_id][qidx];
+ lcore_conf = &lcore_cfgs[lcore];
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- /* sleep for 1 microsecond */
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
- /* use tpause if we have it */
- if (global_data.intrinsics_support.power_pause) {
- const uint64_t cur = rte_rdtsc();
- const uint64_t wait_tsc =
- cur + global_data.tsc_per_us;
- rte_power_pause(wait_tsc);
- } else {
- uint64_t i;
- for (i = 0; i < global_data.pause_per_us; i++)
- rte_pause();
- }
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* sleep for 1 microsecond, use tpause if we have it */
+ if (global_data.intrinsics_support.power_pause) {
+ const uint64_t cur = rte_rdtsc();
+ const uint64_t wait_tsc =
+ cur + global_data.tsc_per_us;
+ rte_power_pause(wait_tsc);
+ } else {
+ uint64_t i;
+ for (i = 0; i < global_data.pause_per_us; i++)
+ rte_pause();
}
- } else
- q_conf->empty_poll_stats = 0;
+ }
return nb_rx;
}
static uint16_t
-clb_scale_freq(uint16_t port_id, uint16_t qidx,
+clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
- uint16_t max_pkts __rte_unused, void *_ __rte_unused)
+ uint16_t max_pkts __rte_unused, void *arg)
{
- struct pmd_queue_cfg *q_conf;
+ const unsigned int lcore = rte_lcore_id();
+ const bool empty = nb_rx == 0;
+ struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+ struct queue_list_entry *queue_conf = arg;
- q_conf = &port_cfg[port_id][qidx];
+ if (likely(!empty)) {
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
- if (unlikely(nb_rx == 0)) {
- q_conf->empty_poll_stats++;
- if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
- /* scale down freq */
- rte_power_freq_min(rte_lcore_id());
- } else {
- q_conf->empty_poll_stats = 0;
- /* scale up freq */
+ /* scale up freq immediately */
rte_power_freq_max(rte_lcore_id());
+ } else {
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ rte_power_freq_min(rte_lcore_id());
}
return nb_rx;
@@ -167,11 +322,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
}
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+ const struct queue_list_entry *entry;
+
+ TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+ const union queue *q = &entry->queue;
+ int ret = queue_stopped(q->portid, q->qid);
+ if (ret != 1)
+ return ret;
+ }
+ return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+ enum power_management_env env;
+
+ /* only PSTATE and ACPI modes are supported */
+ if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+ !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+ return -ENOTSUP;
+ }
+ /* ensure we could initialize the power library */
+ if (rte_power_init(lcore))
+ return -EINVAL;
+
+ /* ensure we initialized the correct env */
+ env = rte_power_get_env();
+ if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+ struct rte_power_monitor_cond dummy;
+
+ /* check if rte_power_monitor is supported */
+ if (!global_data.intrinsics_support.power_monitor) {
+ RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+ return -ENOTSUP;
+ }
+
+ if (cfg->n_queues > 0) {
+ RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+ return -ENOTSUP;
+ }
+
+ /* check if the device supports the necessary PMD API */
+ if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+ &dummy) == -ENOTSUP) {
+ RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+ return -ENOTSUP;
+ }
+
+ /* we're done */
+ return 0;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
struct rte_eth_dev_info info;
rte_rx_callback_fn clb;
int ret;
@@ -202,9 +426,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
goto end;
}
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ ret = ret < 0 ? -EINVAL : -EBUSY;
+ goto end;
+ }
+
+ /* if callback was already enabled, check current callback type */
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+ lcore_cfg->cb_mode != mode) {
ret = -EINVAL;
goto end;
}
@@ -214,53 +448,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
switch (mode) {
case RTE_POWER_MGMT_TYPE_MONITOR:
- {
- struct rte_power_monitor_cond dummy;
-
- /* check if rte_power_monitor is supported */
- if (!global_data.intrinsics_support.power_monitor) {
- RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_monitor(lcore_cfg, &qdata);
+ if (ret < 0)
goto end;
- }
- /* check if the device supports the necessary PMD API */
- if (rte_eth_get_monitor_addr(port_id, queue_id,
- &dummy) == -ENOTSUP) {
- RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_umwait;
break;
- }
case RTE_POWER_MGMT_TYPE_SCALE:
- {
- enum power_management_env env;
- /* only PSTATE and ACPI modes are supported */
- if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
- !rte_power_check_env_supported(
- PM_ENV_PSTATE_CPUFREQ)) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
- ret = -ENOTSUP;
+ /* check if we can add a new queue */
+ ret = check_scale(lcore_id);
+ if (ret < 0)
goto end;
- }
- /* ensure we could initialize the power library */
- if (rte_power_init(lcore_id)) {
- ret = -EINVAL;
- goto end;
- }
- /* ensure we initialized the correct env */
- env = rte_power_get_env();
- if (env != PM_ENV_ACPI_CPUFREQ &&
- env != PM_ENV_PSTATE_CPUFREQ) {
- RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
- ret = -ENOTSUP;
- goto end;
- }
clb = clb_scale_freq;
break;
- }
case RTE_POWER_MGMT_TYPE_PAUSE:
/* figure out various time-to-tsc conversions */
if (global_data.tsc_per_us == 0)
@@ -273,13 +474,27 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
ret = -EINVAL;
goto end;
}
+ /* add this queue to the list */
+ ret = queue_list_add(lcore_cfg, &qdata);
+ if (ret < 0) {
+ RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+ strerror(-ret));
+ goto end;
+ }
+ /* new queue is always added last */
+ queue_cfg = TAILQ_LAST(&lcore_cfg->head, queue_list_head);
+
+ /* when enabling first queue, ensure sleep target is not 0 */
+ if (lcore_cfg->n_queues == 1 && lcore_cfg->sleep_target == 0)
+ lcore_cfg->sleep_target = 1;
/* initialize data before enabling the callback */
- queue_cfg->empty_poll_stats = 0;
- queue_cfg->cb_mode = mode;
- queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
- queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
- clb, NULL);
+ if (lcore_cfg->n_queues == 1) {
+ lcore_cfg->cb_mode = mode;
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+ }
+ queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb, queue_cfg);
ret = 0;
end:
@@ -290,7 +505,9 @@ int
rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id, uint16_t queue_id)
{
- struct pmd_queue_cfg *queue_cfg;
+ const union queue qdata = {.portid = port_id, .qid = queue_id};
+ struct pmd_core_cfg *lcore_cfg;
+ struct queue_list_entry *queue_cfg;
int ret;
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,24 +523,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
}
/* no need to check queue id as wrong queue id would not be enabled */
- queue_cfg = &port_cfg[port_id][queue_id];
+ lcore_cfg = &lcore_cfgs[lcore_id];
- if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+ /* check if other queues are stopped as well */
+ ret = cfg_queues_stopped(lcore_cfg);
+ if (ret != 1) {
+ /* error means invalid queue, 0 means queue wasn't stopped */
+ return ret < 0 ? -EINVAL : -EBUSY;
+ }
+
+ if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
return -EINVAL;
- /* stop any callbacks from progressing */
- queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+ /*
+ * There is no good/easy way to do this without race conditions, so we
+ * are just going to throw our hands in the air and hope that the user
+ * has read the documentation and has ensured that ports are stopped at
+ * the time we enter the API functions.
+ */
+ queue_cfg = queue_list_take(lcore_cfg, &qdata);
+ if (queue_cfg == NULL)
+ return -ENOENT;
- switch (queue_cfg->cb_mode) {
+ /* if we've removed all queues from the lists, set state to disabled */
+ if (lcore_cfg->n_queues == 0)
+ lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+ switch (lcore_cfg->cb_mode) {
case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
case RTE_POWER_MGMT_TYPE_PAUSE:
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
break;
case RTE_POWER_MGMT_TYPE_SCALE:
rte_power_freq_max(lcore_id);
- rte_eth_remove_rx_callback(port_id, queue_id,
- queue_cfg->cur_cb);
+ rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
rte_power_exit(lcore_id);
break;
}
@@ -332,7 +565,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
* ports before calling any of these API's, so we can assume that the
* callbacks can be freed. we're intentionally casting away const-ness.
*/
- rte_free((void *)queue_cfg->cur_cb);
+ rte_free((void *)queue_cfg->cb);
+ free(queue_cfg);
return 0;
}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+ size_t i;
+
+ /* initialize all tailqs */
+ for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
+ struct pmd_core_cfg *cfg = &lcore_cfgs[i];
+ TAILQ_INIT(&cfg->head);
+ }
+}
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v10 7/8] power: support monitoring multiple Rx queues
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
` (5 preceding siblings ...)
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 6/8] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-07-09 16:08 ` Anatoly Burakov
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 8/8] examples/l3fwd-power: support multiq in PMD modes Anatoly Burakov
2021-07-09 19:24 ` [dpdk-dev] [PATCH v10 0/8] Enhancements for PMD power management David Marchand
8 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:08 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
Notes:
v6:
- Fix the missed feedback from v5
v4:
- Fix possible out of bounds access
- Added missing index increment
doc/guides/prog_guide/power_man.rst | 15 ++++--
lib/power/rte_power_pmd_mgmt.c | 82 ++++++++++++++++++++++++++++-
2 files changed, 90 insertions(+), 7 deletions(-)
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0e66878892..e387d7811e 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,17 +221,22 @@ power saving whenever empty poll count reaches a certain number.
The "monitor" mode is only supported in the following configurations and scenarios:
* On Linux* x86_64, `rte_power_monitor()` requires WAITPKG instruction set being
- supported by the CPU. Please refer to your platform documentation for further
- information.
+ supported by the CPU, while `rte_power_monitor_multi()` requires WAITPKG and
+ RTM instruction sets being supported by the CPU. RTM instruction set may also
+ require booting the Linux with `tsx=on` command line parameter. Please refer
+ to your platform documentation for further information.
* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+ ``rte_power_monitor_multi()`` function is supported by the platform, then
+ monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
``rte_power_monitor()`` is supported by the platform, then monitoring will be
limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
monitored from a different lcore).
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
- ``rte_power_monitor()`` function is not supported, then monitor mode will not
- be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+ two monitoring functions are supported, then monitor mode will not be supported.
* Not all Ethernet drivers support monitoring, even if the underlying
platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 30772791af..2586204b93 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -126,6 +126,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
return found;
}
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+ struct rte_power_monitor_cond *pmc, size_t len)
+{
+ const struct queue_list_entry *qle;
+ size_t i = 0;
+ int ret;
+
+ TAILQ_FOREACH(qle, &cfg->head, next) {
+ const union queue *q = &qle->queue;
+ struct rte_power_monitor_cond *cur;
+
+ /* attempted out of bounds access */
+ if (i >= len) {
+ RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
+ return -1;
+ }
+
+ cur = &pmc[i++];
+ ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
static void
calc_tsc(void)
{
@@ -215,6 +241,46 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
return true;
}
+static uint16_t
+clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *arg)
+{
+ const unsigned int lcore = rte_lcore_id();
+ struct queue_list_entry *queue_conf = arg;
+ struct pmd_core_cfg *lcore_conf;
+ const bool empty = nb_rx == 0;
+
+ lcore_conf = &lcore_cfgs[lcore];
+
+ /* early exit */
+ if (likely(!empty))
+ /* early exit */
+ queue_reset(lcore_conf, queue_conf);
+ else {
+ struct rte_power_monitor_cond pmc[lcore_conf->n_queues];
+ int ret;
+
+ /* can this queue sleep? */
+ if (!queue_can_sleep(lcore_conf, queue_conf))
+ return nb_rx;
+
+ /* can this lcore sleep? */
+ if (!lcore_can_sleep(lcore_conf))
+ return nb_rx;
+
+ /* gather all monitoring conditions */
+ ret = get_monitor_addresses(lcore_conf, pmc,
+ lcore_conf->n_queues);
+ if (ret < 0)
+ return nb_rx;
+
+ rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX);
+ }
+
+ return nb_rx;
+}
+
static uint16_t
clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
@@ -366,14 +432,19 @@ static int
check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
{
struct rte_power_monitor_cond dummy;
+ bool multimonitor_supported;
/* check if rte_power_monitor is supported */
if (!global_data.intrinsics_support.power_monitor) {
RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
return -ENOTSUP;
}
+ /* check if multi-monitor is supported */
+ multimonitor_supported =
+ global_data.intrinsics_support.power_monitor_multi;
- if (cfg->n_queues > 0) {
+ /* if we're adding a new queue, do we support multiple queues? */
+ if (cfg->n_queues > 0 && !multimonitor_supported) {
RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
return -ENOTSUP;
}
@@ -389,6 +460,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
return 0;
}
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+ return global_data.intrinsics_support.power_monitor_multi ?
+ clb_multiwait : clb_umwait;
+}
+
int
rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -453,7 +531,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
if (ret < 0)
goto end;
- clb = clb_umwait;
+ clb = get_monitor_callback();
break;
case RTE_POWER_MGMT_TYPE_SCALE:
/* check if we can add a new queue */
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* [dpdk-dev] [PATCH v10 8/8] examples/l3fwd-power: support multiq in PMD modes
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
` (6 preceding siblings ...)
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 7/8] power: support monitoring " Anatoly Burakov
@ 2021-07-09 16:08 ` Anatoly Burakov
2021-07-09 19:24 ` [dpdk-dev] [PATCH v10 0/8] Enhancements for PMD power management David Marchand
8 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-09 16:08 UTC (permalink / raw)
To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus
Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: David Hunt <david.hunt@intel.com>
---
examples/l3fwd-power/main.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..52f56dc405 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2723,12 +2723,6 @@ main(int argc, char **argv)
printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
fflush(stdout);
- /* PMD power management mode can only do 1 queue per core */
- if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
- rte_exit(EXIT_FAILURE,
- "In PMD power management mode, only one queue per lcore is allowed\n");
- }
-
/* init RX queues */
for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
struct rte_eth_rxconf rxq_conf;
--
2.25.1
^ permalink raw reply [flat|nested] 165+ messages in thread
* Re: [dpdk-dev] [PATCH v10 0/8] Enhancements for PMD power management
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 " Anatoly Burakov
` (7 preceding siblings ...)
2021-07-09 16:08 ` [dpdk-dev] [PATCH v10 8/8] examples/l3fwd-power: support multiq in PMD modes Anatoly Burakov
@ 2021-07-09 19:24 ` David Marchand
8 siblings, 0 replies; 165+ messages in thread
From: David Marchand @ 2021-07-09 19:24 UTC (permalink / raw)
To: Anatoly Burakov
Cc: dev, David Hunt, Ananyev, Konstantin, Ciara Loftus, Yigit,
Ferruh, Andrew Rybchenko, Thomas Monjalon
On Fri, Jul 9, 2021 at 6:08 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> This patchset introduces several changes related to PMD power management:
>
> - Changed monitoring intrinsics to use callbacks as a comparison function, based
> on previous patchset [1] but incorporating feedback [2] - this hopefully will
> make it possible to add support for .get_monitor_addr in virtio
> - Add a new intrinsic to monitor multiple addresses, based on RTM instruction
> set and the TPAUSE instruction
> - Add support for PMD power management on multiple queues, as well as all
> accompanying infrastructure and example apps changes
>
> v10:
> - Added missing changes to NIC feature .ini files
>
> v9:
> - Added all missing Acks and Tests
> - Added a new commit with NIC features
> - Addressed minor issues raised in review
>
> v8:
> - Fixed checkpatch issue
> - Added comment explaining empty poll handling (Konstantin)
>
> v7:
> - Fixed various bugs
>
> v6:
> - Improved the algorithm for multi-queue sleep
> - Fixed segfault and addressed other feedback
>
> v5:
> - Removed "power save queue" API and replaced with mechanism suggested by
> Konstantin
> - Addressed other feedback
>
> v4:
> - Replaced raw number with a macro
> - Fixed all the bugs found by Konstantin
> - Some other minor corrections
>
> v3:
> - Moved some doc updates to NIC features list
>
> v2:
> - Changed check inversion to callbacks
> - Addressed feedback from Konstantin
> - Added doc updates where necessary
>
> [1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
> [2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274
>
> Anatoly Burakov (8):
> eal: use callbacks for power monitoring comparison
> net/af_xdp: add power monitor support
> doc: add PMD power management NIC feature
> eal: add power monitor for multiple events
> power: remove thread safety from PMD power API's
> power: support callbacks for multiple Rx queues
> power: support monitoring multiple Rx queues
> examples/l3fwd-power: support multiq in PMD modes
Overall, the series lgtm.
I still have a comment on the opaque pointer passed in callbacks.
This is not blocking, we can still go with followup patches in this release.
It would be great if drivers maintainers could implement this new ops
in their driver or give feedback on what should be enhanced.
Series applied, thanks.
--
David Marchand
^ permalink raw reply [flat|nested] 165+ messages in thread