* [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices @ 2020-05-27 17:02 Anatoly Burakov 2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov ` (7 more replies) 0 siblings, 8 replies; 421+ messages in thread From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw) To: dev; +Cc: david.hunt, liang.j.ma This patchset proposes a simple API for Ethernet drivers to cause the CPU to enter a power-optimized state while waiting for packets to arrive, along with a set of (hopefully generic) intrinsics that facilitate that. This is achieved through cooperation with the NIC driver that will allow us to know address of the next NIC RX ring packet descriptor, and wait for writes on it. On IA, this is achieved through using UMONITOR/UMWAIT instructions. They are used in their raw opcode form because there is no widespread compiler support for them yet. Still, the API is made generic enough to hopefully support other architectures, if they happen to implement similar instructions. To achieve power savings, there is a very simple mechanism used: we're counting empty polls, and if a certain threshold is reached, we get the address of next RX ring descriptor from the NIC driver, arm the monitoring hardware, and enter a power-optimized state. We will then wake up when either a timeout happens, or a write happens (or generally whenever CPU feels like waking up - this is platform- specific), and proceed as normal. The empty poll counter is reset whenever we actually get packets, so we only go to sleep when we know nothing is going on. Why are we putting it into ethdev as opposed to leaving this up to the application? Our customers specifically requested a way to do it wit minimal changes to the application code. The current approach allows to just flip a switch and automagically have power savings. There are certain limitations in this patchset right now: - Currently, only 1:1 core to queue mapping is supported, meaning that each lcore must at most handle RX on a single queue - Currently, power management is enabled per-port, not per-queue - There is potential to greatly increase TX latency if we are buffering things, and go to sleep before sending packets - The API is not perfect and could use some improvement and discussion - The API doesn't extend to other device types - The intrinsics are platform-specific, so ethdev has some platform-specific code in it - Support was only implemented for devices using net/ixgbe, net/i40e and net/ice drivers Hopefully this would generate enough feedback to clear a path forward! Anatoly Burakov (6): eal: add power management intrinsics ethdev: add simple power management API net/ixgbe: implement power management API net/i40e: implement power management API net/ice: implement power management API app/testpmd: add command for power management on a port app/test-pmd/cmdline.c | 48 +++++++ drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 23 +++ drivers/net/i40e/i40e_rxtx.h | 2 + drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 23 +++ drivers/net/ice/ice_rxtx.h | 2 + drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 22 +++ drivers/net/ixgbe/ixgbe_rxtx.h | 2 + .../include/generic/rte_power_intrinsics.h | 64 +++++++++ lib/librte_eal/include/meson.build | 1 + lib/librte_eal/x86/include/meson.build | 1 + lib/librte_eal/x86/include/rte_cpuflags.h | 1 + .../x86/include/rte_power_intrinsics.h | 134 ++++++++++++++++++ lib/librte_eal/x86/rte_cpuflags.c | 2 + lib/librte_ethdev/rte_ethdev.c | 39 +++++ lib/librte_ethdev/rte_ethdev.h | 70 +++++++++ lib/librte_ethdev/rte_ethdev_core.h | 41 +++++- lib/librte_ethdev/rte_ethdev_version.map | 4 + 20 files changed, 480 insertions(+), 2 deletions(-) create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h -- 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [RFC 1/6] eal: add power management intrinsics 2020-05-27 17:02 [dpdk-dev] [RFC 0/6] Power-optimized RX for Ethernet devices Anatoly Burakov @ 2020-05-27 17:02 ` Anatoly Burakov 2020-05-28 11:39 ` Ananyev, Konstantin 2020-11-02 11:09 ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma 2020-05-27 17:02 ` [dpdk-dev] [RFC 2/6] ethdev: add simple power management API Anatoly Burakov ` (6 subsequent siblings) 7 siblings, 2 replies; 421+ messages in thread From: Anatoly Burakov @ 2020-05-27 17:02 UTC (permalink / raw) To: dev; +Cc: Bruce Richardson, Konstantin Ananyev, david.hunt, liang.j.ma Add two new power management intrinsics, and provide an implementation in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions are implemented as raw byte opcodes because there is not yet widespread compiler support for these instructions. The power management instructions provide an architecture-specific function to either wait until a specified TSC timestamp is reached, or optionally wait until either a TSC timestamp is reached or a memory location is written to. The monitor function also provides an optional comparison, to avoid sleeping when the expected write has already happened, and no more writes are expected. Signed-off-by: Liang J. Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- .../include/generic/rte_power_intrinsics.h | 64 +++++++++ lib/librte_eal/include/meson.build | 1 + lib/librte_eal/x86/include/meson.build | 1 + lib/librte_eal/x86/include/rte_cpuflags.h | 1 + .../x86/include/rte_power_intrinsics.h | 134 ++++++++++++++++++ lib/librte_eal/x86/rte_cpuflags.c | 2 + 6 files changed, 203 insertions(+) create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h new file mode 100644 index 0000000000..8646c4ac16 --- /dev/null +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Intel Corporation + */ + +#ifndef _RTE_POWER_INTRINSIC_H_ +#define _RTE_POWER_INTRINSIC_H_ + +#include <inttypes.h> + +/** + * @file + * Advanced power management operations. + * + * This file define APIs for advanced power management, + * which are architecture-dependent. + */ + +/** + * Monitor specific address for changes. This will cause the CPU to enter an + * architecture-defined optimized power state until either the specified + * memory address is written to, or a certain TSC timestamp is reached. + * + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If + * mask is non-zero, the current value pointed to by the `p` pointer will be + * checked against the expected value, and if they match, the entering of + * optimized power state may be aborted. + * + * @param p + * Address to monitor for changes. Must be aligned on an 8-byte boundary. + * @param expected_value + * Before attempting the monitoring, the `p` address may be read and compared + * against this value. If `value_mask` is zero, this step will be skipped. + * @param value_mask + * The 64-bit mask to use to extract current value from `p`. + * @param state + * Architecture-dependent optimized power state number + * @param tsc_timestamp + * Maximum TSC timestamp to wait for. Note that the wait behavior is + * architecture-dependent. + * + * @return + * Architecture-dependent return value. + */ +static inline int rte_power_monitor(const volatile void *p, + const uint64_t expected_value, const uint64_t value_mask, + const uint32_t state, const uint64_t tsc_timestamp); + +/** + * Enter an architecture-defined optimized power state until a certain TSC + * timestamp is reached. + * + * @param state + * Architecture-dependent optimized power state number + * @param tsc_timestamp + * Maximum TSC timestamp to wait for. Note that the wait behavior is + * architecture-dependent. + * + * @return + * Architecture-dependent return value. + */ +static inline int rte_power_pause(const uint32_t state, + const uint64_t tsc_timestamp); + +#endif /* _RTE_POWER_INTRINSIC_H_ */ diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build index bc73ec2c5c..b54a2be4f6 100644 --- a/lib/librte_eal/include/meson.build +++ b/lib/librte_eal/include/meson.build @@ -59,6 +59,7 @@ generic_headers = files( 'generic/rte_memcpy.h', 'generic/rte_pause.h', 'generic/rte_prefetch.h', + 'generic/rte_power_intrinsics.h', 'generic/rte_rwlock.h', 'generic/rte_spinlock.h', 'generic/rte_ticketlock.h', diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build index f0e998c2fe..494a8142a2 100644 --- a/lib/librte_eal/x86/include/meson.build +++ b/lib/librte_eal/x86/include/meson.build @@ -13,6 +13,7 @@ arch_headers = files( 'rte_io.h', 'rte_memcpy.h', 'rte_prefetch.h', + 'rte_power_intrinsics.h', 'rte_pause.h', 'rte_rtm.h', 'rte_rwlock.h', diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h index c1d20364d1..94d6a43763 100644 --- a/lib/librte_eal/x86/include/rte_cpuflags.h +++ b/lib/librte_eal/x86/include/rte_cpuflags.h @@ -110,6 +110,7 @@ enum rte_cpu_flag_t { RTE_CPUFLAG_RDTSCP, /**< RDTSCP */ RTE_CPUFLAG_EM64T, /**< EM64T */ + RTE_CPUFLAG_WAITPKG, /**< UMINITOR/UMWAIT/TPAUSE */ /* (EAX 80000007h) EDX features */ RTE_CPUFLAG_INVTSC, /**< INVTSC */ diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h new file mode 100644 index 0000000000..a0522400fb --- /dev/null +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h @@ -0,0 +1,134 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Intel Corporation + */ + +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_ +#define _RTE_POWER_INTRINSIC_X86_64_H_ + +#ifdef __cplusplus +extern "C" { +#endif + +#include <rte_atomic.h> +#include <rte_common.h> + +#include "generic/rte_power_intrinsics.h" + +/** + * Monitor specific address for changes. This will cause the CPU to enter an + * architecture-defined optimized power state until either the specified + * memory address is written to, or a certain TSC timestamp is reached. + * + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If + * mask is non-zero, the current value pointed to by the `p` pointer will be + * checked against the expected value, and if they match, the entering of + * optimized power state may be aborted. + * + * This function uses UMONITOR/UMWAIT instructions. For more information about + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software + * Developer's Manual. + * + * @param p + * Address to monitor for changes. Must be aligned on an 8-byte boundary. + * @param expected_value + * Before attempting the monitoring, the `p` address may be read and compared + * against this value. If `value_mask` is zero, this step will be skipped. + * @param value_mask + * The 64-bit mask to use to extract current value from `p`. + * @param state + * Architecture-dependent optimized power state number. Can be 0 (C0.2) or + * 1 (C0.1). + * @param tsc_timestamp + * Maximum TSC timestamp to wait for. + * + * @return + * - 1 if wakeup was due to TSC timeout expiration. + * - 0 if wakeup was due to memory write or other reasons. + */ +static inline int rte_power_monitor(const volatile void *p, + const uint64_t expected_value, const uint64_t value_mask, + const uint32_t state, const uint64_t tsc_timestamp) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + uint64_t rflags; + + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + rte_mb(); + if (value_mask) { + const uint64_t cur_value = *(const volatile uint64_t *)p; + const uint64_t masked = cur_value & value_mask; + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return 0; + } + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n" + /* + * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them + * onto the stack, then pop them back into `rflags` so that + * we can read it. + */ + "pushf;\n" + "pop %0;\n" + : "=r"(rflags) + : "D"(state), "a"(tsc_l), "d"(tsc_h)); + + /* we're interested in the first bit (the carry flag) */ + return rflags & 0x1; +} + +/** + * Enter an architecture-defined optimized power state until a certain TSC + * timestamp is reached. + * + * This function uses TPAUSE instruction. For more information about its usage, + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's + * Manual. + * + * @param state + * Architecture-dependent optimized power state number. Can be 0 (C0.2) or + * 1 (C0.1). + * @param tsc_timestamp + * Maximum TSC timestamp to wait for. + * + * @return + * - 1 if wakeup was due to TSC timeout expiration. + * - 0 if wakeup was due to other reasons. + */ +static inline int rte_power_pause(const uint32_t state, + const uint64_t tsc_timestamp) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + uint64_t rflags; + + /* execute TPAUSE */ + asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n" + /* + * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them + * onto the stack, then pop them back into `rflags` so that + * we can read it. + */ + "pushf;\n" + "pop %0;\n" + : "=r"(rflags) + : "D"(state), "a"(tsc_l), "d"(tsc_h)); + + /* we're interested in the first bit (the carry flag) */ + return rflags & 0x1; +} + +#ifdef __cplusplus +} +#endif + +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */ diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c index 30439e7951..0325c4b93b 100644 --- a/lib/librte_eal/x86/rte_cpuflags.c +++ b/lib/librte_eal/x86/rte_cpuflags.c @@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = { FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16) FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18) + FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5) + FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX, 0) FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX, 4) -- 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics 2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov @ 2020-05-28 11:39 ` Ananyev, Konstantin 2020-05-28 14:40 ` Burakov, Anatoly 2020-11-02 11:09 ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma 1 sibling, 1 reply; 421+ messages in thread From: Ananyev, Konstantin @ 2020-05-28 11:39 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli Hi Anatoly, > > Add two new power management intrinsics, and provide an implementation > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions > are implemented as raw byte opcodes because there is not yet widespread > compiler support for these instructions. > > The power management instructions provide an architecture-specific > function to either wait until a specified TSC timestamp is reached, or > optionally wait until either a TSC timestamp is reached or a memory > location is written to. The monitor function also provides an optional > comparison, to avoid sleeping when the expected write has already > happened, and no more writes are expected. Recently ARM guys introduced new generic API for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64). Probably would make sense to unite both APIs into something common and HW transparent. Konstantin > > Signed-off-by: Liang J. Ma <liang.j.ma@intel.com> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > --- > .../include/generic/rte_power_intrinsics.h | 64 +++++++++ > lib/librte_eal/include/meson.build | 1 + > lib/librte_eal/x86/include/meson.build | 1 + > lib/librte_eal/x86/include/rte_cpuflags.h | 1 + > .../x86/include/rte_power_intrinsics.h | 134 ++++++++++++++++++ > lib/librte_eal/x86/rte_cpuflags.c | 2 + > 6 files changed, 203 insertions(+) > create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h > create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h > > diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h > new file mode 100644 > index 0000000000..8646c4ac16 > --- /dev/null > +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h > @@ -0,0 +1,64 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2020 Intel Corporation > + */ > + > +#ifndef _RTE_POWER_INTRINSIC_H_ > +#define _RTE_POWER_INTRINSIC_H_ > + > +#include <inttypes.h> > + > +/** > + * @file > + * Advanced power management operations. > + * > + * This file define APIs for advanced power management, > + * which are architecture-dependent. > + */ > + > +/** > + * Monitor specific address for changes. This will cause the CPU to enter an > + * architecture-defined optimized power state until either the specified > + * memory address is written to, or a certain TSC timestamp is reached. > + * > + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If > + * mask is non-zero, the current value pointed to by the `p` pointer will be > + * checked against the expected value, and if they match, the entering of > + * optimized power state may be aborted. > + * > + * @param p > + * Address to monitor for changes. Must be aligned on an 8-byte boundary. > + * @param expected_value > + * Before attempting the monitoring, the `p` address may be read and compared > + * against this value. If `value_mask` is zero, this step will be skipped. > + * @param value_mask > + * The 64-bit mask to use to extract current value from `p`. > + * @param state > + * Architecture-dependent optimized power state number > + * @param tsc_timestamp > + * Maximum TSC timestamp to wait for. Note that the wait behavior is > + * architecture-dependent. > + * > + * @return > + * Architecture-dependent return value. > + */ > +static inline int rte_power_monitor(const volatile void *p, > + const uint64_t expected_value, const uint64_t value_mask, > + const uint32_t state, const uint64_t tsc_timestamp); > + > +/** > + * Enter an architecture-defined optimized power state until a certain TSC > + * timestamp is reached. > + * > + * @param state > + * Architecture-dependent optimized power state number > + * @param tsc_timestamp > + * Maximum TSC timestamp to wait for. Note that the wait behavior is > + * architecture-dependent. > + * > + * @return > + * Architecture-dependent return value. > + */ > +static inline int rte_power_pause(const uint32_t state, > + const uint64_t tsc_timestamp); > + > +#endif /* _RTE_POWER_INTRINSIC_H_ */ > diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build > index bc73ec2c5c..b54a2be4f6 100644 > --- a/lib/librte_eal/include/meson.build > +++ b/lib/librte_eal/include/meson.build > @@ -59,6 +59,7 @@ generic_headers = files( > 'generic/rte_memcpy.h', > 'generic/rte_pause.h', > 'generic/rte_prefetch.h', > + 'generic/rte_power_intrinsics.h', > 'generic/rte_rwlock.h', > 'generic/rte_spinlock.h', > 'generic/rte_ticketlock.h', > diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build > index f0e998c2fe..494a8142a2 100644 > --- a/lib/librte_eal/x86/include/meson.build > +++ b/lib/librte_eal/x86/include/meson.build > @@ -13,6 +13,7 @@ arch_headers = files( > 'rte_io.h', > 'rte_memcpy.h', > 'rte_prefetch.h', > + 'rte_power_intrinsics.h', > 'rte_pause.h', > 'rte_rtm.h', > 'rte_rwlock.h', > diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h > index c1d20364d1..94d6a43763 100644 > --- a/lib/librte_eal/x86/include/rte_cpuflags.h > +++ b/lib/librte_eal/x86/include/rte_cpuflags.h > @@ -110,6 +110,7 @@ enum rte_cpu_flag_t { > RTE_CPUFLAG_RDTSCP, /**< RDTSCP */ > RTE_CPUFLAG_EM64T, /**< EM64T */ > > + RTE_CPUFLAG_WAITPKG, /**< UMINITOR/UMWAIT/TPAUSE */ > /* (EAX 80000007h) EDX features */ > RTE_CPUFLAG_INVTSC, /**< INVTSC */ > > diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h > new file mode 100644 > index 0000000000..a0522400fb > --- /dev/null > +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h > @@ -0,0 +1,134 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2020 Intel Corporation > + */ > + > +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_ > +#define _RTE_POWER_INTRINSIC_X86_64_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include <rte_atomic.h> > +#include <rte_common.h> > + > +#include "generic/rte_power_intrinsics.h" > + > +/** > + * Monitor specific address for changes. This will cause the CPU to enter an > + * architecture-defined optimized power state until either the specified > + * memory address is written to, or a certain TSC timestamp is reached. > + * > + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If > + * mask is non-zero, the current value pointed to by the `p` pointer will be > + * checked against the expected value, and if they match, the entering of > + * optimized power state may be aborted. > + * > + * This function uses UMONITOR/UMWAIT instructions. For more information about > + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software > + * Developer's Manual. > + * > + * @param p > + * Address to monitor for changes. Must be aligned on an 8-byte boundary. > + * @param expected_value > + * Before attempting the monitoring, the `p` address may be read and compared > + * against this value. If `value_mask` is zero, this step will be skipped. > + * @param value_mask > + * The 64-bit mask to use to extract current value from `p`. > + * @param state > + * Architecture-dependent optimized power state number. Can be 0 (C0.2) or > + * 1 (C0.1). > + * @param tsc_timestamp > + * Maximum TSC timestamp to wait for. > + * > + * @return > + * - 1 if wakeup was due to TSC timeout expiration. > + * - 0 if wakeup was due to memory write or other reasons. > + */ > +static inline int rte_power_monitor(const volatile void *p, > + const uint64_t expected_value, const uint64_t value_mask, > + const uint32_t state, const uint64_t tsc_timestamp) > +{ > + const uint32_t tsc_l = (uint32_t)tsc_timestamp; > + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); > + uint64_t rflags; > + > + /* > + * we're using raw byte codes for now as only the newest compiler > + * versions support this instruction natively. > + */ > + > + /* set address for UMONITOR */ > + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" > + : > + : "D"(p)); > + rte_mb(); > + if (value_mask) { > + const uint64_t cur_value = *(const volatile uint64_t *)p; > + const uint64_t masked = cur_value & value_mask; > + /* if the masked value is already matching, abort */ > + if (masked == expected_value) > + return 0; > + } > + /* execute UMWAIT */ > + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n" > + /* > + * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them > + * onto the stack, then pop them back into `rflags` so that > + * we can read it. > + */ > + "pushf;\n" > + "pop %0;\n" > + : "=r"(rflags) > + : "D"(state), "a"(tsc_l), "d"(tsc_h)); > + > + /* we're interested in the first bit (the carry flag) */ > + return rflags & 0x1; > +} > + > +/** > + * Enter an architecture-defined optimized power state until a certain TSC > + * timestamp is reached. > + * > + * This function uses TPAUSE instruction. For more information about its usage, > + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's > + * Manual. > + * > + * @param state > + * Architecture-dependent optimized power state number. Can be 0 (C0.2) or > + * 1 (C0.1). > + * @param tsc_timestamp > + * Maximum TSC timestamp to wait for. > + * > + * @return > + * - 1 if wakeup was due to TSC timeout expiration. > + * - 0 if wakeup was due to other reasons. > + */ > +static inline int rte_power_pause(const uint32_t state, > + const uint64_t tsc_timestamp) > +{ > + const uint32_t tsc_l = (uint32_t)tsc_timestamp; > + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); > + uint64_t rflags; > + > + /* execute TPAUSE */ > + asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n" > + /* > + * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them > + * onto the stack, then pop them back into `rflags` so that > + * we can read it. > + */ > + "pushf;\n" > + "pop %0;\n" > + : "=r"(rflags) > + : "D"(state), "a"(tsc_l), "d"(tsc_h)); > + > + /* we're interested in the first bit (the carry flag) */ > + return rflags & 0x1; > +} > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */ > diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c > index 30439e7951..0325c4b93b 100644 > --- a/lib/librte_eal/x86/rte_cpuflags.c > +++ b/lib/librte_eal/x86/rte_cpuflags.c > @@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = { > FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16) > FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18) > > + FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5) > + > FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX, 0) > FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX, 4) > > -- > 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics 2020-05-28 11:39 ` Ananyev, Konstantin @ 2020-05-28 14:40 ` Burakov, Anatoly 2020-05-28 14:58 ` Bruce Richardson 2020-05-28 15:38 ` Ananyev, Konstantin 0 siblings, 2 replies; 421+ messages in thread From: Burakov, Anatoly @ 2020-05-28 14:40 UTC (permalink / raw) To: Ananyev, Konstantin, dev Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli On 28-May-20 12:39 PM, Ananyev, Konstantin wrote: > Hi Anatoly, > >> >> Add two new power management intrinsics, and provide an implementation >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions >> are implemented as raw byte opcodes because there is not yet widespread >> compiler support for these instructions. >> >> The power management instructions provide an architecture-specific >> function to either wait until a specified TSC timestamp is reached, or >> optionally wait until either a TSC timestamp is reached or a memory >> location is written to. The monitor function also provides an optional >> comparison, to avoid sleeping when the expected write has already >> happened, and no more writes are expected. > > Recently ARM guys introduced new generic API > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64). > Probably would make sense to unite both APIs into something common > and HW transparent. > Konstantin Hi Konstantin, That's not really similar purpose. This is monitoring a cacheline for writes, not waiting on a specific value. The "expected" value is there as basically a hack to get around the race condition due to the fact that by the time you enter monitoring state, the write you're waiting for may have already happened. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics 2020-05-28 14:40 ` Burakov, Anatoly @ 2020-05-28 14:58 ` Bruce Richardson 2020-05-28 15:38 ` Ananyev, Konstantin 1 sibling, 0 replies; 421+ messages in thread From: Bruce Richardson @ 2020-05-28 14:58 UTC (permalink / raw) To: Burakov, Anatoly Cc: Ananyev, Konstantin, dev, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli On Thu, May 28, 2020 at 03:40:18PM +0100, Burakov, Anatoly wrote: > On 28-May-20 12:39 PM, Ananyev, Konstantin wrote: > > Hi Anatoly, > > > > > > > > Add two new power management intrinsics, and provide an implementation > > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions > > > are implemented as raw byte opcodes because there is not yet widespread > > > compiler support for these instructions. > > > > > > The power management instructions provide an architecture-specific > > > function to either wait until a specified TSC timestamp is reached, or > > > optionally wait until either a TSC timestamp is reached or a memory > > > location is written to. The monitor function also provides an optional > > > comparison, to avoid sleeping when the expected write has already > > > happened, and no more writes are expected. > > > > Recently ARM guys introduced new generic API > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64). > > Probably would make sense to unite both APIs into something common > > and HW transparent. > > Konstantin > > Hi Konstantin, > > That's not really similar purpose. This is monitoring a cacheline for > writes, not waiting on a specific value. The "expected" value is there as > basically a hack to get around the race condition due to the fact that by > the time you enter monitoring state, the write you're waiting for may have > already happened. > Rather than the "expected" value, is it not more useful for a general API to check for the existing value? Since we are awaiting writes, we may not know what new value will be written, but we do know what the value is now, and can just check that it's not been changed. Regards, /Bruce ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics 2020-05-28 14:40 ` Burakov, Anatoly 2020-05-28 14:58 ` Bruce Richardson @ 2020-05-28 15:38 ` Ananyev, Konstantin 2020-05-29 6:56 ` Jerin Jacob 1 sibling, 1 reply; 421+ messages in thread From: Ananyev, Konstantin @ 2020-05-28 15:38 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli > > Hi Anatoly, > > > >> > >> Add two new power management intrinsics, and provide an implementation > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions > >> are implemented as raw byte opcodes because there is not yet widespread > >> compiler support for these instructions. > >> > >> The power management instructions provide an architecture-specific > >> function to either wait until a specified TSC timestamp is reached, or > >> optionally wait until either a TSC timestamp is reached or a memory > >> location is written to. The monitor function also provides an optional > >> comparison, to avoid sleeping when the expected write has already > >> happened, and no more writes are expected. > > > > Recently ARM guys introduced new generic API > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64). > > Probably would make sense to unite both APIs into something common > > and HW transparent. > > Konstantin > > Hi Konstantin, > > That's not really similar purpose. This is monitoring a cacheline for > writes, not waiting on a specific value. I understand that. > The "expected" value is there > as basically a hack to get around the race condition due to the fact > that by the time you enter monitoring state, the write you're waiting > for may have already happened. AFAIK, current rte_wait_until_equal_* does pretty much the same thing: LDXR memaddr, $reg // an address to monitor for if ($reg != expected_value) SEVL // arm monitor do { WFE // waits for write to that memory address LDXR memaddr, $reg } while ($reg != expected_value); Looks pretty similar to what rte_power_monitor() does, except you don't have a loop for checking the new value. Plus rte_power_monitor() provides extra options to the user - timestamp and power save mode to enter. Also I don't know what is the granularity of such events on ARM, is it a cache-line or more/less. Might be ARM people can comment/correct me here. Konstantin ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics 2020-05-28 15:38 ` Ananyev, Konstantin @ 2020-05-29 6:56 ` Jerin Jacob 2020-06-02 10:15 ` Ananyev, Konstantin 2020-06-03 6:22 ` Honnappa Nagarahalli 0 siblings, 2 replies; 421+ messages in thread From: Jerin Jacob @ 2020-05-29 6:56 UTC (permalink / raw) To: Ananyev, Konstantin Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin <konstantin.ananyev@intel.com> wrote: > > > > > Hi Anatoly, > > > > > >> > > >> Add two new power management intrinsics, and provide an implementation > > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions > > >> are implemented as raw byte opcodes because there is not yet widespread > > >> compiler support for these instructions. > > >> > > >> The power management instructions provide an architecture-specific > > >> function to either wait until a specified TSC timestamp is reached, or > > >> optionally wait until either a TSC timestamp is reached or a memory > > >> location is written to. The monitor function also provides an optional > > >> comparison, to avoid sleeping when the expected write has already > > >> happened, and no more writes are expected. > > > > > > Recently ARM guys introduced new generic API > > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64). > > > Probably would make sense to unite both APIs into something common > > > and HW transparent. > > > Konstantin > > > > Hi Konstantin, > > > > That's not really similar purpose. This is monitoring a cacheline for > > writes, not waiting on a specific value. > > I understand that. > > > The "expected" value is there > > as basically a hack to get around the race condition due to the fact > > that by the time you enter monitoring state, the write you're waiting > > for may have already happened. > > AFAIK, current rte_wait_until_equal_* does pretty much the same thing: > > LDXR memaddr, $reg // an address to monitor for > if ($reg != expected_value) > SEVL // arm monitor > do { > WFE // waits for write to that memory address > LDXR memaddr, $reg > } while ($reg != expected_value); > > Looks pretty similar to what rte_power_monitor() does, > except you don't have a loop for checking the new value. > Plus rte_power_monitor() provides extra options to the user - > timestamp and power save mode to enter. > Also I don't know what is the granularity of such events on ARM, > is it a cache-line or more/less. As I understand it, Granularity is per the cache-line. ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until the cache line is written. But I see UMONITOR bit different, Where _without_ other core signaling to wakeup from wait state, it can wake on TSC expiry. I think, that's is the main primitive on this feature. Right? WFE can also wake based on Timer stream events(kind of TSC in x86 analogy) but it has a configuration bit that needs to allow for this scheme in userspace(EL0) or not? defined by EL1(Linux kernel). I am planning to spend time on this after understanding the value addition of the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020-May/168888.html > Might be ARM people can comment/correct me here. > Konstantin ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics 2020-05-29 6:56 ` Jerin Jacob @ 2020-06-02 10:15 ` Ananyev, Konstantin 2020-06-03 6:22 ` Honnappa Nagarahalli 1 sibling, 0 replies; 421+ messages in thread From: Ananyev, Konstantin @ 2020-06-02 10:15 UTC (permalink / raw) To: Jerin Jacob Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma, Liang J, Honnappa.Nagarahalli > > On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin > <konstantin.ananyev@intel.com> wrote: > > > > > > > > Hi Anatoly, > > > > > > > >> > > > >> Add two new power management intrinsics, and provide an implementation > > > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions > > > >> are implemented as raw byte opcodes because there is not yet widespread > > > >> compiler support for these instructions. > > > >> > > > >> The power management instructions provide an architecture-specific > > > >> function to either wait until a specified TSC timestamp is reached, or > > > >> optionally wait until either a TSC timestamp is reached or a memory > > > >> location is written to. The monitor function also provides an optional > > > >> comparison, to avoid sleeping when the expected write has already > > > >> happened, and no more writes are expected. > > > > > > > > Recently ARM guys introduced new generic API > > > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64). > > > > Probably would make sense to unite both APIs into something common > > > > and HW transparent. > > > > Konstantin > > > > > > Hi Konstantin, > > > > > > That's not really similar purpose. This is monitoring a cacheline for > > > writes, not waiting on a specific value. > > > > I understand that. > > > > > The "expected" value is there > > > as basically a hack to get around the race condition due to the fact > > > that by the time you enter monitoring state, the write you're waiting > > > for may have already happened. > > > > AFAIK, current rte_wait_until_equal_* does pretty much the same thing: > > > > LDXR memaddr, $reg // an address to monitor for > > if ($reg != expected_value) > > SEVL // arm monitor > > do { > > WFE // waits for write to that memory address > > LDXR memaddr, $reg > > } while ($reg != expected_value); > > > > Looks pretty similar to what rte_power_monitor() does, > > except you don't have a loop for checking the new value. > > Plus rte_power_monitor() provides extra options to the user - > > timestamp and power save mode to enter. > > Also I don't know what is the granularity of such events on ARM, > > is it a cache-line or more/less. > > As I understand it, Granularity is per the cache-line. > ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power > state until the cache line is written. > > But I see UMONITOR bit different, Where _without_ other core signaling > to wakeup from wait state, > it can wake on TSC expiry. I think, that's is the main primitive on > this feature. Right? > > WFE can also wake based on Timer stream events(kind of TSC in x86 > analogy) but it has a configuration > bit that needs to allow for this scheme in userspace(EL0) or not? > defined by EL1(Linux kernel). > I am planning to spend time on this after understanding the value > addition of the feature/usecase[1] > [1] > http://mails.dpdk.org/archives/dev/2020-May/168888.html > Ok, if there is a consensus to keep these two APIs disjoint for now - I wouldn't insist. Konstantin ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics 2020-05-29 6:56 ` Jerin Jacob 2020-06-02 10:15 ` Ananyev, Konstantin @ 2020-06-03 6:22 ` Honnappa Nagarahalli 2020-06-03 6:31 ` Jerin Jacob 1 sibling, 1 reply; 421+ messages in thread From: Honnappa Nagarahalli @ 2020-06-03 6:22 UTC (permalink / raw) To: Jerin Jacob, Ananyev, Konstantin Cc: Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma, Liang J, nd, nd <snip> > > On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin > <konstantin.ananyev@intel.com> wrote: > > > > > > > > Hi Anatoly, > > > > > > > >> > > > >> Add two new power management intrinsics, and provide an > > > >> implementation in eal/x86 based on UMONITOR/UMWAIT instructions. > > > >> The instructions are implemented as raw byte opcodes because > > > >> there is not yet widespread compiler support for these instructions. > > > >> > > > >> The power management instructions provide an > > > >> architecture-specific function to either wait until a specified > > > >> TSC timestamp is reached, or optionally wait until either a TSC > > > >> timestamp is reached or a memory location is written to. The > > > >> monitor function also provides an optional comparison, to avoid > > > >> sleeping when the expected write has already happened, and no more > writes are expected. > > > > > > > > Recently ARM guys introduced new generic API for similar (as I > > > > understand) purposes: rte_wait_until_equal_(16|32|64). > > > > Probably would make sense to unite both APIs into something common > > > > and HW transparent. > > > > Konstantin > > > > > > Hi Konstantin, > > > > > > That's not really similar purpose. This is monitoring a cacheline > > > for writes, not waiting on a specific value. > > > > I understand that. > > > > > The "expected" value is there > > > as basically a hack to get around the race condition due to the fact > > > that by the time you enter monitoring state, the write you're > > > waiting for may have already happened. > > > > AFAIK, current rte_wait_until_equal_* does pretty much the same thing: > > > > LDXR memaddr, $reg // an address to monitor for if ($reg != > > expected_value) > > SEVL // arm monitor > > do { > > WFE // waits for write to that memory address > > LDXR memaddr, $reg > > } while ($reg != expected_value); > > > > Looks pretty similar to what rte_power_monitor() does, except you > > don't have a loop for checking the new value. > > Plus rte_power_monitor() provides extra options to the user - > > timestamp and power save mode to enter. > > Also I don't know what is the granularity of such events on ARM, is it > > a cache-line or more/less. > > As I understand it, Granularity is per the cache-line. > ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until > the cache line is written. Architecture allows for 16B to 2048B space. Typically, implementations use cache-line granularity. > > But I see UMONITOR bit different, Where _without_ other core signaling to > wakeup from wait state, it can wake on TSC expiry. I think, that's is the main > primitive on this feature. Right? > > WFE can also wake based on Timer stream events(kind of TSC in x86 > analogy) but it has a configuration > bit that needs to allow for this scheme in userspace(EL0) or not? > defined by EL1(Linux kernel). Timer stream events are not per CPU core. They are system wide streams. > I am planning to spend time on this after understanding the value addition of > the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020- > May/168888.html > > > > > > > Might be ARM people can comment/correct me here. > > Konstantin ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [RFC 1/6] eal: add power management intrinsics 2020-06-03 6:22 ` Honnappa Nagarahalli @ 2020-06-03 6:31 ` Jerin Jacob 0 siblings, 0 replies; 421+ messages in thread From: Jerin Jacob @ 2020-06-03 6:31 UTC (permalink / raw) To: Honnappa Nagarahalli Cc: Ananyev, Konstantin, Burakov, Anatoly, dev, Richardson, Bruce, Hunt, David, Ma, Liang J, nd On Wed, Jun 3, 2020 at 11:53 AM Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote: > > <snip> > > > > > On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin > > <konstantin.ananyev@intel.com> wrote: > > > > > > > > > > > Hi Anatoly, > > > > > > > > > >> > > > > >> Add two new power management intrinsics, and provide an > > > > >> implementation in eal/x86 based on UMONITOR/UMWAIT instructions. > > > > >> The instructions are implemented as raw byte opcodes because > > > > >> there is not yet widespread compiler support for these instructions. > > > > >> > > > > >> The power management instructions provide an > > > > >> architecture-specific function to either wait until a specified > > > > >> TSC timestamp is reached, or optionally wait until either a TSC > > > > >> timestamp is reached or a memory location is written to. The > > > > >> monitor function also provides an optional comparison, to avoid > > > > >> sleeping when the expected write has already happened, and no more > > writes are expected. > > > > > > > > > > Recently ARM guys introduced new generic API for similar (as I > > > > > understand) purposes: rte_wait_until_equal_(16|32|64). > > > > > Probably would make sense to unite both APIs into something common > > > > > and HW transparent. > > > > > Konstantin > > > > > > > > Hi Konstantin, > > > > > > > > That's not really similar purpose. This is monitoring a cacheline > > > > for writes, not waiting on a specific value. > > > > > > I understand that. > > > > > > > The "expected" value is there > > > > as basically a hack to get around the race condition due to the fact > > > > that by the time you enter monitoring state, the write you're > > > > waiting for may have already happened. > > > > > > AFAIK, current rte_wait_until_equal_* does pretty much the same thing: > > > > > > LDXR memaddr, $reg // an address to monitor for if ($reg != > > > expected_value) > > > SEVL // arm monitor > > > do { > > > WFE // waits for write to that memory address > > > LDXR memaddr, $reg > > > } while ($reg != expected_value); > > > > > > Looks pretty similar to what rte_power_monitor() does, except you > > > don't have a loop for checking the new value. > > > Plus rte_power_monitor() provides extra options to the user - > > > timestamp and power save mode to enter. > > > Also I don't know what is the granularity of such events on ARM, is it > > > a cache-line or more/less. > > > > As I understand it, Granularity is per the cache-line. > > ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until > > the cache line is written. > Architecture allows for 16B to 2048B space. Typically, implementations use cache-line granularity. > > > > > But I see UMONITOR bit different, Where _without_ other core signaling to > > wakeup from wait state, it can wake on TSC expiry. I think, that's is the main > > primitive on this feature. Right? > > > > WFE can also wake based on Timer stream events(kind of TSC in x86 > > analogy) but it has a configuration > > bit that needs to allow for this scheme in userspace(EL0) or not? > > defined by EL1(Linux kernel). > Timer stream events are not per CPU core. They are system wide streams. We may not need per core support to implement this use case. I think, currently, kernel configured to have a WFE signal on every 100us.(System-wide). do while{} loop can check if it is passing the requested timestamp after WFE. But minimum granularity will be 100us.(i.e 100us worth of ticks) > > > I am planning to spend time on this after understanding the value addition of > > the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020- > > May/168888.html > > > > > > > > > > > > > Might be ARM people can comment/correct me here. > > > Konstantin ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt 2020-05-27 17:02 ` [dpdk-dev] [RFC 1/6] eal: add power management intrinsics Anatoly Burakov 2020-05-28 11:39 ` Ananyev, Konstantin @ 2020-11-02 11:09 ` Liang Ma 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API Liang Ma ` (17 more replies) 1 sibling, 18 replies; 421+ messages in thread From: Liang Ma @ 2020-11-02 11:09 UTC (permalink / raw) To: dev Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan, yongwang This patchset proposes a simple API for Ethernet drivers to cause the CPU to enter a power-optimized state while waiting for packets to arrive, along with a set of generic intrinsics that facilitate that. This is achieved through cooperation with the NIC driver that will allow us to know address of wake up event, and wait for writes on it. On IA, this is achieved through using UMONITOR/UMWAIT instructions. They are used in their raw opcode form because there is no widespread compiler support for them yet. Still, the API is made generic enough to hopefully support other architectures, if they happen to implement similar instructions. To achieve power savings, there is a very simple mechanism used: we're counting empty polls, and if a certain threshold is reached, we get the address of next RX ring descriptor from the NIC driver, arm the monitoring hardware, and enter a power-optimized state. We will then wake up when either a timeout happens, or a write happens (or generally whenever CPU feels like waking up - this is platform- specific), and proceed as normal. The empty poll counter is reset whenever we actually get packets, so we only go to sleep when we know nothing is going on. The mechanism is generic which can be used for any write back descriptor. Why are we putting it into ethdev as opposed to leaving this up to the application? Our customers specifically requested a way to do it wit minimal changes to the application code. The current approach allows to just flip a switch and automatically have power savings. - Only 1:1 core to queue mapping is supported, meaning that each lcore must at most handle RX on a single queue - Support 3 type policies. UMWAIT/PAUSE/Frequency_Scale - Power management is enabled per-queue - The API doesn't extend to other device types Liang Ma (6): ethdev: add simple power management API power: add PMD power management API and callback net/ixgbe: implement power management API net/i40e: implement power management API net/ice: implement power management API examples/l3fwd-power: enable PMD power mgmt doc/guides/prog_guide/power_man.rst | 51 +++ doc/guides/rel_notes/release_20_11.rst | 17 + .../sample_app_ug/l3_forward_power_man.rst | 14 + drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 26 ++ drivers/net/i40e/i40e_rxtx.h | 2 + drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++ drivers/net/ice/ice_rxtx.h | 2 + drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 ++ drivers/net/ixgbe/ixgbe_rxtx.h | 2 + examples/l3fwd-power/main.c | 46 ++- lib/librte_ethdev/rte_ethdev.c | 23 ++ lib/librte_ethdev/rte_ethdev.h | 41 +++ lib/librte_ethdev/rte_ethdev_driver.h | 28 ++ lib/librte_ethdev/version.map | 1 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 320 ++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 92 +++++ lib/librte_power/version.map | 4 + 21 files changed, 725 insertions(+), 3 deletions(-) create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h -- 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API 2020-11-02 11:09 ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma @ 2020-11-02 11:10 ` Liang Ma 2020-11-02 12:23 ` Burakov, Anatoly 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 2/6] power: add PMD power management API and callback Liang Ma ` (16 subsequent siblings) 17 siblings, 1 reply; 421+ messages in thread From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw) To: dev Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan, yongwang, Anatoly Burakov, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella Add a simple API to allow getting address of getting notification information from the PMD, as well as release notes information. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- Notes: v11: - Rework the API Doxygen documentation v10: - Address minor issue on comments and release notes v8: - Rename version map file name v7: - Fixed race condition (Konstantin) - Slight rework of the structure of monitor code - Added missing inline for wakeup v6: - Added wakeup mechanism for UMWAIT - Removed memory allocation (everything is now allocated statically) - Fixed various typos and comments - Check for invalid queue ID - Moved release notes to this patch v5: - Make error checking more robust - Prevent initializing scaling if ACPI or PSTATE env wasn't set - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr - Add some debug logging - Replace x86-specific code path to generic path using the intrinsic check --- doc/guides/rel_notes/release_20_11.rst | 4 +++ lib/librte_ethdev/rte_ethdev.c | 23 +++++++++++++++ lib/librte_ethdev/rte_ethdev.h | 41 ++++++++++++++++++++++++++ lib/librte_ethdev/rte_ethdev_driver.h | 28 ++++++++++++++++++ lib/librte_ethdev/version.map | 1 + 5 files changed, 97 insertions(+) diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst index 88b9086390..e95e6aa7a5 100644 --- a/doc/guides/rel_notes/release_20_11.rst +++ b/doc/guides/rel_notes/release_20_11.rst @@ -148,6 +148,10 @@ New Features Hairpin Tx part flow rules can be inserted explicitly. New API is added to get the hairpin peer ports list. +* **ethdev: added 1 new API for PMD power management.** + + * ``rte_eth_get_wake_addr()`` is added to get the wake up address from device. + * **Updated Broadcom bnxt driver.** Updated the Broadcom bnxt driver with new features and improvements, including: diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c index b12bb3854d..4f3115fe8e 100644 --- a/lib/librte_ethdev/rte_ethdev.c +++ b/lib/librte_ethdev/rte_ethdev.c @@ -5138,6 +5138,29 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode)); } +int +rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id, + volatile void **wake_addr, uint64_t *expected, uint64_t *mask, + uint8_t *data_sz) +{ + struct rte_eth_dev *dev; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV); + + dev = &rte_eth_devices[port_id]; + + RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_wake_addr, -ENOTSUP); + + if (queue_id >= dev->data->nb_rx_queues) { + RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id); + return -EINVAL; + } + + return eth_err(port_id, + dev->dev_ops->get_wake_addr(dev->data->rx_queues[queue_id], + wake_addr, expected, mask, data_sz)); +} + int rte_eth_dev_set_mc_addr_list(uint16_t port_id, struct rte_ether_addr *mc_addr_set, diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h index e341a08817..d208fe99ca 100644 --- a/lib/librte_ethdev/rte_ethdev.h +++ b/lib/librte_ethdev/rte_ethdev.h @@ -4364,6 +4364,47 @@ __rte_experimental int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, struct rte_eth_burst_mode *mode); +/** + * In order to make use of some PMD Power Management schemes, the user might + * want to wait (sleep/poll) until new packets arrive. This function + * retrieves the necessary information from the PMD to enter that wait/sleep + * state. The main parameter provided is the address to monitor while waiting + * to wake up. In addition to this wake address, the function also provides + * extra information including expected value, selection mask and data size to + * monitor. The user is expected to use this information to enter low power + * mode using the rte_power_monitor API, and the core will exit low power mode + * upon reaching the expected condition: + * (((uint64_t)read_mem(wake_addr, data_sz)) & mask) == expected). + * @note The low power mode can also exit in other cases, e.g. interrupt. + * + * @param[in] port_id + * The port identifier of the Ethernet device. + * @param[in] queue_id + * The Rx queue on the Ethernet device for which information will be + * retrieved. + * @param[out] wake_addr + * The pointer to the address which will be monitored. + * @param[out] expected + * The pointer to the expected value to allow wakeup condition. + * @param[out] mask + * The pointer to comparison bitmask for the expected value. + * @note a mask value of zero should be treated as: + * “no special wakeup values for provided address from the driver”. + * @param[out] data_sz + * The pointer to data size for the expected value (in bytes) + * @note valid values are 1,2,4,8 + * + * @return + * - 0: Success. + * -ENOTSUP: Operation not supported. + * -EINVAL: Invalid parameters. + * -ENODEV: Invalid port ID. + */ +__rte_experimental +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id, + volatile void **wake_addr, uint64_t *expected, uint64_t *mask, + uint8_t *data_sz); + /** * Retrieve device registers and register attributes (number of registers and * register size) diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h index c63b9f7eb7..e7ce1e261d 100644 --- a/lib/librte_ethdev/rte_ethdev_driver.h +++ b/lib/librte_ethdev/rte_ethdev_driver.h @@ -752,6 +752,32 @@ typedef int (*eth_hairpin_queue_peer_unbind_t) (struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction); /**< @internal Unbind peer queue from the current queue. */ +/** + * @internal + * Get address of memory location whose contents will change whenever there is + * new data to be received. + * + * @param rxq + * Ethdev queue pointer. + * @param tail_desc_addr + * The pointer point to where the address will be stored. + * @param expected + * The pointer point to value to be expected when descriptor is set. + * @param mask + * The pointer point to comparison bitmask for the expected value. + * @param data_sz + * Data size for the expected value (can be 1, 2, 4, or 8 bytes) + * @return + * Negative errno value on error, 0 on success. + * + * @retval 0 + * Success + * @retval -EINVAL + * Invalid parameters + */ +typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t *data_sz); + /** * @internal A structure containing the functions exported by an Ethernet driver. */ @@ -910,6 +936,8 @@ struct eth_dev_ops { /**< Set up the connection between the pair of hairpin queues. */ eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind; /**< Disconnect the hairpin queues of a pair from each other. */ + eth_get_wake_addr_t get_wake_addr; + /**< Get next RX queue ring entry address. */ }; /** diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map index 8ddda2547f..f9ce4e3c8d 100644 --- a/lib/librte_ethdev/version.map +++ b/lib/librte_ethdev/version.map @@ -235,6 +235,7 @@ EXPERIMENTAL { rte_eth_fec_get_capability; rte_eth_fec_get; rte_eth_fec_set; + rte_eth_get_wake_addr; rte_flow_shared_action_create; rte_flow_shared_action_destroy; rte_flow_shared_action_query; -- 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API Liang Ma @ 2020-11-02 12:23 ` Burakov, Anatoly 0 siblings, 0 replies; 421+ messages in thread From: Burakov, Anatoly @ 2020-11-02 12:23 UTC (permalink / raw) To: Liang Ma, dev Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan, yongwang, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella On 02-Nov-20 11:10 AM, Liang Ma wrote: > Add a simple API to allow getting address of getting notification > information from the PMD, as well as release notes information. > > Signed-off-by: Liang Ma <liang.j.ma@intel.com> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > --- > > Notes: > v11: > - Rework the API Doxygen documentation > > v10: > - Address minor issue on comments and release notes > > v8: > - Rename version map file name > > v7: > - Fixed race condition (Konstantin) > - Slight rework of the structure of monitor code > - Added missing inline for wakeup > > v6: > - Added wakeup mechanism for UMWAIT > - Removed memory allocation (everything is now allocated statically) > - Fixed various typos and comments > - Check for invalid queue ID > - Moved release notes to this patch > > v5: > - Make error checking more robust > - Prevent initializing scaling if ACPI or PSTATE env wasn't set > - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr > - Add some debug logging > - Replace x86-specific code path to generic path using the intrinsic check > --- <snip> > int > rte_eth_dev_set_mc_addr_list(uint16_t port_id, > struct rte_ether_addr *mc_addr_set, > diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h > index e341a08817..d208fe99ca 100644 > --- a/lib/librte_ethdev/rte_ethdev.h > +++ b/lib/librte_ethdev/rte_ethdev.h > @@ -4364,6 +4364,47 @@ __rte_experimental > int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, > struct rte_eth_burst_mode *mode); > > +/** > + * In order to make use of some PMD Power Management schemes, the user might > + * want to wait (sleep/poll) until new packets arrive. This function > + * retrieves the necessary information from the PMD to enter that wait/sleep > + * state. The main parameter provided is the address to monitor while waiting > + * to wake up. In addition to this wake address, the function also provides > + * extra information including expected value, selection mask and data size to > + * monitor. The user is expected to use this information to enter low power > + * mode using the rte_power_monitor API, and the core will exit low power mode > + * upon reaching the expected condition: > + * (((uint64_t)read_mem(wake_addr, data_sz)) & mask) == expected). > + * @note The low power mode can also exit in other cases, e.g. interrupt. Could we maybe have some paragraphs and spacing here, instead of one solid block of text? Suggested formatting: * In order to make use of some PMD Power Management schemes, the user might * want to wait (sleep/poll) until new packets arrive. This function retrieves * the necessary information from the PMD to enter that wait/sleep state. * * The main parameter provided is the address to monitor while waiting to wake * up. In addition to this wake address, the function also provides extra * information including expected value, selection mask and data size to * monitor. * * The user is expected to use this information to enter low power mode using * the rte_power_monitor API, and the core will exit low power mode upon * reaching the expected condition: * * `(((uint64_t)read_mem(wake_addr, data_sz)) & mask) == expected)` * * @note The low power mode can also exit in other cases, e.g. interrupt. > + * > + * @param[in] port_id > + * The port identifier of the Ethernet device. > + * @param[in] queue_id > + * The Rx queue on the Ethernet device for which information will be > + * retrieved. > + * @param[out] wake_addr > + * The pointer to the address which will be monitored. > + * @param[out] expected > + * The pointer to the expected value to allow wakeup condition. > + * @param[out] mask > + * The pointer to comparison bitmask for the expected value. > + * @note a mask value of zero should be treated as: > + * “no special wakeup values for provided address from the driver”. Wrong quotes. > + * @param[out] data_sz > + * The pointer to data size for the expected value (in bytes) > + * @note valid values are 1,2,4,8 Shouldn't @note be under @param, i.e. indented to the right at the same level the text is? Also, everywhere else in this file, the indentation of the text is two spaces, not one. So, should be e.g. * @param[out] data_sz * The pointer to data...... * @note valid values are..... > + * > + * @return > + * - 0: Success. > + * -ENOTSUP: Operation not supported. > + * -EINVAL: Invalid parameters. > + * -ENODEV: Invalid port ID. > + */ > +__rte_experimental > +int rte_eth_get_wake_addr(uint16_t port_id, uint16_t queue_id, > + volatile void **wake_addr, uint64_t *expected, uint64_t *mask, > + uint8_t *data_sz); > + > /** > * Retrieve device registers and register attributes (number of registers and > * register size) > diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h > index c63b9f7eb7..e7ce1e261d 100644 > --- a/lib/librte_ethdev/rte_ethdev_driver.h > +++ b/lib/librte_ethdev/rte_ethdev_driver.h > @@ -752,6 +752,32 @@ typedef int (*eth_hairpin_queue_peer_unbind_t) > (struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction); > /**< @internal Unbind peer queue from the current queue. */ > > +/** > + * @internal > + * Get address of memory location whose contents will change whenever there is > + * new data to be received. > + * > + * @param rxq > + * Ethdev queue pointer. > + * @param tail_desc_addr > + * The pointer point to where the address will be stored. > + * @param expected > + * The pointer point to value to be expected when descriptor is set. > + * @param mask > + * The pointer point to comparison bitmask for the expected value. > + * @param data_sz > + * Data size for the expected value (can be 1, 2, 4, or 8 bytes) > + * @return > + * Negative errno value on error, 0 on success. > + * > + * @retval 0 > + * Success > + * @retval -EINVAL > + * Invalid parameters > + */ > +typedef int (*eth_get_wake_addr_t)(void *rxq, volatile void **tail_desc_addr, > + uint64_t *expected, uint64_t *mask, uint8_t *data_sz); > + > /** > * @internal A structure containing the functions exported by an Ethernet driver. > */ > @@ -910,6 +936,8 @@ struct eth_dev_ops { > /**< Set up the connection between the pair of hairpin queues. */ > eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind; > /**< Disconnect the hairpin queues of a pair from each other. */ > + eth_get_wake_addr_t get_wake_addr; > + /**< Get next RX queue ring entry address. */ > }; > > /** > diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map > index 8ddda2547f..f9ce4e3c8d 100644 > --- a/lib/librte_ethdev/version.map > +++ b/lib/librte_ethdev/version.map > @@ -235,6 +235,7 @@ EXPERIMENTAL { > rte_eth_fec_get_capability; > rte_eth_fec_get; > rte_eth_fec_set; > + rte_eth_get_wake_addr; > rte_flow_shared_action_create; > rte_flow_shared_action_destroy; > rte_flow_shared_action_query; > -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v11 2/6] power: add PMD power management API and callback 2020-11-02 11:09 ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API Liang Ma @ 2020-11-02 11:10 ` Liang Ma 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 3/6] net/ixgbe: implement power management API Liang Ma ` (15 subsequent siblings) 17 siblings, 0 replies; 421+ messages in thread From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw) To: dev Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan, yongwang, Anatoly Burakov, Ray Kinsella Add a simple on/off switch that will enable saving power when no packets are arriving. It is based on counting the number of empty polls and, when the number reaches a certain threshold, entering an architecture-defined optimized power state that will either wait until a TSC timestamp expires, or when packets arrive. This API mandates a core-to-single-queue mapping (that is, multiple queued per device are supported, but they have to be polled on different cores). This design is using PMD RX callbacks. The following are the available schemes: 1. UMWAIT/UMONITOR: When a certain threshold of empty polls is reached, the core will go into a power optimized sleep while waiting on an address of next RX descriptor to be written to. 2. Pause instruction Instead of move the core into deeper C state, this method uses the pause instruction to avoid busy polling. 3. Frequency scaling Reuse existing DPDK power library to scale up/down core frequency depending on traffic volume. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: David Hunt <david.hunt@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- Notes: v11: - Updated power library document - Updated release notes v10: - Updated power library document v8: - Rename version map file name v7: - Fixed race condition (Konstantin) - Slight rework of the structure of monitor code - Added missing inline for wakeup v6: - Added wakeup mechanism for UMWAIT - Removed memory allocation (everything is now allocated statically) - Fixed various typos and comments - Check for invalid queue ID - Moved release notes to this patch v5: - Make error checking more robust - Prevent initializing scaling if ACPI or PSTATE env wasn't set - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr - Add some debug logging - Replace x86-specific code path to generic path using the intrinsic check --- doc/guides/prog_guide/power_man.rst | 51 ++++ doc/guides/rel_notes/release_20_11.rst | 15 +- lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 320 +++++++++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 92 +++++++ lib/librte_power/version.map | 4 + 6 files changed, 484 insertions(+), 3 deletions(-) create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index 0a3755a901..380e7aace7 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -192,6 +192,54 @@ User Cases ---------- The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA. +PMD Power Management API +------------------------ + +Abstract +~~~~~~~~ + +Existing Power Management mechanisms require developers to change the design of +an application or change code to make use of it. The PMD Power Management API +provides a convenient alternative by use Ethernet PMD RX callbacks, and +triggering power saving whenever the empty poll count reaches a certain number. + +There are multiple power saving schemes available for the developer to choose. +Although the developer can configure each queue with different scheme, It's +strongly recommended to configure the queue within the same port with the same +scheme. + +The following are the available schemes: + + * UMWAIT/UMONITOR + + This power saving scheme will put the core into an optimized power state and + use the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX + descriptor address, and wake the CPU up whenever there's new traffic. + + * Pause + + This power saving scheme will use the "rte_pause" function to reduce the impact + of busy polling. + + * Frequency scaling + + This power saving scheme will use existing power library functionality to + scale the core frequency up/down depending on traffic volume. + + +.. note:: + + Currently, this Power Management API is limited to mapping of 1 queue to 1 + core (multiple queues are supported, but they must be polled from different + cores). + +API Overview for PMD Power Management +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +* **Queue Enable**: Enable a specific power scheme for a certain queue/port/core + +* **Queue Disable**: Disable power scheme for certain queue/port/core + References ---------- @@ -200,3 +248,6 @@ References * The :doc:`../sample_app_ug/vm_power_management` chapter in the :doc:`../sample_app_ug/index` section. + +* The :doc:`../sample_app_ug/rxtx_callbacks` + chapter in the :doc:`../sample_app_ug/index` section. diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst index e95e6aa7a5..0430fca9cc 100644 --- a/doc/guides/rel_notes/release_20_11.rst +++ b/doc/guides/rel_notes/release_20_11.rst @@ -150,7 +150,9 @@ New Features * **ethdev: added 1 new API for PMD power management.** - * ``rte_eth_get_wake_addr()`` is added to get the wake up address from device. + * ``rte_eth_get_wake_addr()`` function has been added to allow applications to + fetch the wake up information from the device. Processor need that information + to wake up from the low power state. * **Updated Broadcom bnxt driver.** @@ -362,6 +364,17 @@ New Features * Replaced ``--scalar`` command-line option with ``--alg=<value>``, to allow the user to select the desired classify method. +* **Added PMD power management mechanism** + + The new Ethernet PMD Power Management mechanisms have been added through the + existing RX callback infrastructure. + + * Added power saving scheme based on UMWAIT instruction (x86 only) + * Added power saving scheme based on ``rte_pause()`` + * Added power saving scheme based on frequency scaling through the power library + * Added new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()`` + * Added new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()`` + Removed Items ------------- diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build index 78c031c943..cc3c7a8646 100644 --- a/lib/librte_power/meson.build +++ b/lib/librte_power/meson.build @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c', 'power_kvm_vm.c', 'guest_channel.c', 'rte_power_empty_poll.c', 'power_pstate_cpufreq.c', + 'rte_power_pmd_mgmt.c', 'power_common.c') -headers = files('rte_power.h','rte_power_empty_poll.h') -deps += ['timer'] +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h') +deps += ['timer' ,'ethdev'] diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c new file mode 100644 index 0000000000..0dcaddc3bd --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.c @@ -0,0 +1,320 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <rte_lcore.h> +#include <rte_cycles.h> +#include <rte_cpuflags.h> +#include <rte_malloc.h> +#include <rte_ethdev.h> +#include <rte_power_intrinsics.h> + +#include "rte_power_pmd_mgmt.h" + +#define EMPTYPOLL_MAX 512 + +/** + * Possible power management states of an ethdev port. + */ +enum pmd_mgmt_state { + /** Device power management is disabled. */ + PMD_MGMT_DISABLED = 0, + /** Device power management is enabled. */ + PMD_MGMT_ENABLED, +}; + +struct pmd_queue_cfg { + enum pmd_mgmt_state pwr_mgmt_state; + /**< State of power management for this queue */ + enum rte_power_pmd_mgmt_type cb_mode; + /**< Callback mode for this queue */ + const struct rte_eth_rxtx_callback *cur_cb; + /**< Callback instance */ + rte_spinlock_t umwait_lock; + /**< Per-queue status lock - used only for UMWAIT mode */ + volatile void *wait_addr; + /**< UMWAIT wakeup address */ + uint64_t empty_poll_stats; + /**< Number of empty polls */ +} __rte_cache_aligned; + +static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT]; + +/* trigger a write to the cache line we're waiting on */ +static inline void +umwait_wakeup(volatile void *addr) +{ + uint64_t val; + + val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED); + __atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0, + __ATOMIC_RELAXED, __ATOMIC_RELAXED); +} + +static inline void +umwait_sleep(struct pmd_queue_cfg *q_conf, uint16_t port_id, uint16_t qidx) +{ + volatile void *target_addr; + uint64_t expected, mask; + uint8_t data_sz; + uint16_t ret; + + /* + * get wake up address for this RX queue, as well as expected value, + * comparison mask, and data size. + */ + ret = rte_eth_get_wake_addr(port_id, qidx, &target_addr, + &expected, &mask, &data_sz); + + /* this should always succeed as all checks have been done already */ + if (unlikely(ret != 0)) + return; + + /* + * take out a spinlock to prevent control plane from concurrently + * modifying the wakeup data. + */ + rte_spinlock_lock(&q_conf->umwait_lock); + + /* have we been disabled by control plane? */ + if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) { + /* we're good to go */ + + /* + * store the wakeup address so that control plane can trigger a + * write to this address and wake us up. + */ + q_conf->wait_addr = target_addr; + /* -1ULL is maximum value for TSC */ + rte_power_monitor_sync(target_addr, expected, mask, -1ULL, + data_sz, &q_conf->umwait_lock); + /* erase the address */ + q_conf->wait_addr = NULL; + } + rte_spinlock_unlock(&q_conf->umwait_lock); +} + +static uint16_t +clb_umwait(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + umwait_sleep(q_conf, port_id, qidx); + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_pause(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + /* sleep for 1 microsecond */ + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + rte_delay_us(1); + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_scale_freq(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *_ __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + /* scale down freq */ + rte_power_freq_min(rte_lcore_id()); + } else { + q_conf->empty_poll_stats = 0; + /* scale up freq */ + rte_power_freq_max(rte_lcore_id()); + } + + return nb_rx; +} + +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode) +{ + struct rte_eth_dev *dev; + struct pmd_queue_cfg *queue_cfg; + int ret; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + dev = &rte_eth_devices[port_id]; + + /* check if queue id is valid */ + if (queue_id >= dev->data->nb_rx_queues || + queue_id >= RTE_MAX_QUEUES_PER_PORT) { + return -EINVAL; + } + + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) { + ret = -EINVAL; + goto end; + } + + switch (mode) { + case RTE_POWER_MGMT_TYPE_WAIT: + { + /* check if rte_power_monitor is supported */ + uint64_t dummy_expected, dummy_mask; + struct rte_cpu_intrinsics i; + volatile void *dummy_addr; + uint8_t dummy_sz; + + rte_cpu_get_intrinsics_support(&i); + + if (!i.power_monitor) { + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); + ret = -ENOTSUP; + goto end; + } + + /* check if the device supports the necessary PMD API */ + if (rte_eth_get_wake_addr(port_id, queue_id, + &dummy_addr, &dummy_expected, + &dummy_mask, &dummy_sz) == -ENOTSUP) { + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n"); + ret = -ENOTSUP; + goto end; + } + /* initialize UMWAIT spinlock */ + rte_spinlock_init(&queue_cfg->umwait_lock); + + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_umwait, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_SCALE: + { + enum power_management_env env; + /* only PSTATE and ACPI modes are supported */ + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && + !rte_power_check_env_supported( + PM_ENV_PSTATE_CPUFREQ)) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); + ret = -ENOTSUP; + goto end; + } + /* ensure we could initialize the power library */ + if (rte_power_init(lcore_id)) { + ret = -EINVAL; + goto end; + } + /* ensure we initialized the correct env */ + env = rte_power_get_env(); + if (env != PM_ENV_ACPI_CPUFREQ && + env != PM_ENV_PSTATE_CPUFREQ) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); + ret = -ENOTSUP; + goto end; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, + queue_id, clb_scale_freq, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_PAUSE: + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_pause, NULL); + break; + } + ret = 0; + +end: + return ret; +} + +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id) +{ + struct pmd_queue_cfg *queue_cfg; + int ret; + + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) { + ret = -EINVAL; + goto end; + } + + switch (queue_cfg->cb_mode) { + case RTE_POWER_MGMT_TYPE_WAIT: + rte_spinlock_lock(&queue_cfg->umwait_lock); + + /* wake up the core from UMWAIT sleep, if any */ + if (queue_cfg->wait_addr != NULL) + umwait_wakeup(queue_cfg->wait_addr); + /* + * we need to disable early as there might be callback currently + * spinning on a lock. + */ + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + + rte_spinlock_unlock(&queue_cfg->umwait_lock); + /* fall-through */ + case RTE_POWER_MGMT_TYPE_PAUSE: + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + break; + case RTE_POWER_MGMT_TYPE_SCALE: + rte_power_freq_max(lcore_id); + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + rte_power_exit(lcore_id); + break; + } + /* + * we don't free the RX callback here because it is unsafe to do so + * unless we know for a fact that all data plane threads have stopped. + */ + queue_cfg->cur_cb = NULL; + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + ret = 0; +end: + return ret; +} diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h new file mode 100644 index 0000000000..a7a3f98268 --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.h @@ -0,0 +1,92 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#ifndef _RTE_POWER_PMD_MGMT_H +#define _RTE_POWER_PMD_MGMT_H + +/** + * @file + * RTE PMD Power Management + */ +#include <stdint.h> +#include <stdbool.h> + +#include <rte_common.h> +#include <rte_byteorder.h> +#include <rte_log.h> +#include <rte_power.h> +#include <rte_atomic.h> + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * PMD Power Management Type + */ +enum rte_power_pmd_mgmt_type { + /** WAIT callback mode. */ + RTE_POWER_MGMT_TYPE_WAIT = 1, + /** PAUSE callback mode. */ + RTE_POWER_MGMT_TYPE_PAUSE, + /** Freq Scaling callback mode. */ + RTE_POWER_MGMT_TYPE_SCALE, +}; + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Setup per-queue power management callback. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @param mode + * The power management callback function type. + + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, + uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode); + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Remove per-queue power management callback. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, + uint16_t queue_id); +#ifdef __cplusplus +} +#endif + +#endif diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map index 69ca9af616..3f2f6cd6f6 100644 --- a/lib/librte_power/version.map +++ b/lib/librte_power/version.map @@ -34,4 +34,8 @@ EXPERIMENTAL { rte_power_guest_channel_receive_msg; rte_power_poll_stat_fetch; rte_power_poll_stat_update; + # added in 20.11 + rte_power_pmd_mgmt_queue_enable; + rte_power_pmd_mgmt_queue_disable; + }; -- 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v11 3/6] net/ixgbe: implement power management API 2020-11-02 11:09 ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 1/6] ethdev: add simple power management API Liang Ma 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 2/6] power: add PMD power management API and callback Liang Ma @ 2020-11-02 11:10 ` Liang Ma 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 4/6] net/i40e: " Liang Ma ` (14 subsequent siblings) 17 siblings, 0 replies; 421+ messages in thread From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw) To: dev Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan, yongwang, Anatoly Burakov, Jeff Guo Implement support for the power management API by implementing a `get_wake_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: Liang Ma <liang.j.ma@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 +++++++++++++++++++++++++ drivers/net/ixgbe/ixgbe_rxtx.h | 2 ++ 3 files changed, 28 insertions(+) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index 00101c2eec..fcc4026372 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -588,6 +588,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = { .udp_tunnel_port_del = ixgbe_dev_udp_tunnel_port_del, .tm_ops_get = ixgbe_tm_ops_get, .tx_done_cleanup = ixgbe_dev_tx_done_cleanup, + .get_wake_addr = ixgbe_get_wake_addr, }; /* diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index 6cfbb582e2..db94b9d05d 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.c +++ b/drivers/net/ixgbe/ixgbe_rxtx.c @@ -1369,6 +1369,31 @@ const uint32_t RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP, }; +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t *data_sz) +{ + volatile union ixgbe_adv_rx_desc *rxdp; + struct ixgbe_rx_queue *rxq = rx_queue; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + *tail_desc_addr = &rxdp->wb.upper.status_error; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + *expected = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + *mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + + /* the registers are 32-bit */ + *data_sz = 4; + + return 0; +} + /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */ static inline uint32_t ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask) diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h index 6d2f7c9da3..1ef0b05e66 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.h +++ b/drivers/net/ixgbe/ixgbe_rxtx.h @@ -299,5 +299,7 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev); +int ixgbe_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t *data_sz); #endif /* _IXGBE_RXTX_H_ */ -- 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v11 4/6] net/i40e: implement power management API 2020-11-02 11:09 ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma ` (2 preceding siblings ...) 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 3/6] net/ixgbe: implement power management API Liang Ma @ 2020-11-02 11:10 ` Liang Ma 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 5/6] net/ice: " Liang Ma ` (13 subsequent siblings) 17 siblings, 0 replies; 421+ messages in thread From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw) To: dev Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan, yongwang, Anatoly Burakov, Beilei Xing, Jeff Guo Implement support for the power management API by implementing a `get_wake_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Acked-by: Jeff Guo <jia.guo@intel.com> --- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 26 ++++++++++++++++++++++++++ drivers/net/i40e/i40e_rxtx.h | 2 ++ 3 files changed, 29 insertions(+) diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c index 4778aaf299..358a38232b 100644 --- a/drivers/net/i40e/i40e_ethdev.c +++ b/drivers/net/i40e/i40e_ethdev.c @@ -513,6 +513,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = { .mtu_set = i40e_dev_mtu_set, .tm_ops_get = i40e_tm_ops_get, .tx_done_cleanup = i40e_tx_done_cleanup, + .get_wake_addr = i40e_get_wake_addr, }; /* store statistics names and its offset in stats structure */ diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 5df9a9df56..78862fe3a2 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -72,6 +72,32 @@ #define I40E_TX_OFFLOAD_NOTSUP_MASK \ (PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK) +int +i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t *data_sz) +{ + struct i40e_rx_queue *rxq = rx_queue; + volatile union i40e_rx_desc *rxdp; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + *tail_desc_addr = &rxdp->wb.qword1.status_error_len; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + *expected = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + *mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + + /* registers are 64-bit */ + *data_sz = 8; + + return 0; +} + static inline void i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp) { diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h index 57d7b4160b..5826cf1099 100644 --- a/drivers/net/i40e/i40e_rxtx.h +++ b/drivers/net/i40e/i40e_rxtx.h @@ -248,6 +248,8 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); +int i40e_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr, + uint64_t *expected, uint64_t *value, uint8_t *data_sz); /* For each value it means, datasheet of hardware can tell more details * -- 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v11 5/6] net/ice: implement power management API 2020-11-02 11:09 ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma ` (3 preceding siblings ...) 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 4/6] net/i40e: " Liang Ma @ 2020-11-02 11:10 ` Liang Ma 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 6/6] examples/l3fwd-power: enable PMD power mgmt Liang Ma ` (12 subsequent siblings) 17 siblings, 0 replies; 421+ messages in thread From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw) To: dev Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan, yongwang, Anatoly Burakov, Qiming Yang, Qi Zhang Implement support for the power management API by implementing a `get_wake_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++++++++++++++++++++++++++ drivers/net/ice/ice_rxtx.h | 2 ++ 3 files changed, 29 insertions(+) diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c index c65125ff32..54f185ad4d 100644 --- a/drivers/net/ice/ice_ethdev.c +++ b/drivers/net/ice/ice_ethdev.c @@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = { .udp_tunnel_port_add = ice_dev_udp_tunnel_port_add, .udp_tunnel_port_del = ice_dev_udp_tunnel_port_del, .tx_done_cleanup = ice_tx_done_cleanup, + .get_wake_addr = ice_get_wake_addr, }; /* store statistics names and its offset in stats structure */ diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c index ee576c362a..fafd6ada62 100644 --- a/drivers/net/ice/ice_rxtx.c +++ b/drivers/net/ice/ice_rxtx.c @@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask; uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask; uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask; +int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t *data_sz) +{ + volatile union ice_rx_flex_desc *rxdp; + struct ice_rx_queue *rxq = rx_queue; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + *tail_desc_addr = &rxdp->wb.status_error0; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + *expected = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + *mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + + /* register is 16-bit */ + *data_sz = 2; + + return 0; +} + + static inline uint8_t ice_proto_xtr_type_to_rxdid(uint8_t xtr_type) { diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h index 1c23c7541e..7eeb8d467e 100644 --- a/drivers/net/ice/ice_rxtx.h +++ b/drivers/net/ice/ice_rxtx.h @@ -250,6 +250,8 @@ uint16_t ice_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc); int ice_tx_done_cleanup(void *txq, uint32_t free_cnt); +int ice_get_wake_addr(void *rx_queue, volatile void **tail_desc_addr, + uint64_t *expected, uint64_t *mask, uint8_t *data_sz); #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \ int i; \ -- 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v11 6/6] examples/l3fwd-power: enable PMD power mgmt 2020-11-02 11:09 ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma ` (4 preceding siblings ...) 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 5/6] net/ice: " Liang Ma @ 2020-11-02 11:10 ` Liang Ma 2020-12-17 14:05 ` [dpdk-dev] [PATCH v12 00/11] Add PMD power management Anatoly Burakov ` (11 subsequent siblings) 17 siblings, 0 replies; 421+ messages in thread From: Liang Ma @ 2020-11-02 11:10 UTC (permalink / raw) To: dev Cc: ruifeng.wang, haiyue.wang, bruce.richardson, konstantin.ananyev, david.hunt, jerinjacobk, nhorman, thomas, timothy.mcdaniel, gage.eads, mw, gtzalik, ajit.khaparde, hkalra, johndale, matan, yongwang, Anatoly Burakov Add PMD power management feature support to l3fwd-power sample app. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: David Hunt <david.hunt@intel.com> --- Notes: v11: - Update l3fwd-power documentation v8: - Add return status check for queue enable v6: - Fixed typos in documentation --- .../sample_app_ug/l3_forward_power_man.rst | 14 ++++++ examples/l3fwd-power/main.c | 46 ++++++++++++++++++- 2 files changed, 59 insertions(+), 1 deletion(-) diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst index d7e1dc5813..149342d112 100644 --- a/doc/guides/sample_app_ug/l3_forward_power_man.rst +++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst @@ -109,6 +109,8 @@ where, * --telemetry: Telemetry mode. +* --pmd-mgmt: PMD power management mode. + See :doc:`l3_forward` for details. The L3fwd-power example reuses the L3fwd command line options. @@ -455,3 +457,15 @@ reference cycles and accordingly busy rate is set to either 0% or The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``. + +PMD power management Mode +------------------------- + +The PMD Power Management mode support for ``l3fwd-power`` is a standalone mode, in this mode +``l3fwd-power`` does simple l3fwding along with enabling the power saving scheme on a specific +port/queue/lcore. The main purpose for this mode is to demonstrate how to use the PMD power +management API. + +.. code-block:: console + + ./build/examples/dpdk-l3fwd-power -l 1-3 -- --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)" diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index a48d75f68f..aafa415f0b 100644 --- a/examples/l3fwd-power/main.c +++ b/examples/l3fwd-power/main.c @@ -47,6 +47,7 @@ #include <rte_power_empty_poll.h> #include <rte_metrics.h> #include <rte_telemetry.h> +#include <rte_power_pmd_mgmt.h> #include "perf_core.h" #include "main.h" @@ -199,7 +200,8 @@ enum appmode { APP_MODE_LEGACY, APP_MODE_EMPTY_POLL, APP_MODE_TELEMETRY, - APP_MODE_INTERRUPT + APP_MODE_INTERRUPT, + APP_MODE_PMD_MGMT }; enum appmode app_mode; @@ -1750,6 +1752,7 @@ parse_ep_config(const char *q_arg) #define CMD_LINE_OPT_EMPTY_POLL "empty-poll" #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only" #define CMD_LINE_OPT_TELEMETRY "telemetry" +#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt" /* Parse the argument given in the command line of the application */ static int @@ -1771,6 +1774,7 @@ parse_args(int argc, char **argv) {CMD_LINE_OPT_LEGACY, 0, 0, 0}, {CMD_LINE_OPT_TELEMETRY, 0, 0, 0}, {CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0}, + {CMD_LINE_OPT_PMD_MGMT, 0, 0, 0}, {NULL, 0, 0, 0} }; @@ -1881,6 +1885,16 @@ parse_args(int argc, char **argv) printf("telemetry mode is enabled\n"); } + if (!strncmp(lgopts[option_index].name, + CMD_LINE_OPT_PMD_MGMT, + sizeof(CMD_LINE_OPT_PMD_MGMT))) { + if (app_mode != APP_MODE_DEFAULT) { + printf(" power mgmt mode is mutually exclusive with other modes\n"); + return -1; + } + app_mode = APP_MODE_PMD_MGMT; + printf("PMD power mgmt mode is enabled\n"); + } if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_INTERRUPT_ONLY, sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) { @@ -2437,6 +2451,8 @@ mode_to_str(enum appmode mode) return "telemetry"; case APP_MODE_INTERRUPT: return "interrupt-only"; + case APP_MODE_PMD_MGMT: + return "pmd mgmt"; default: return "invalid"; } @@ -2705,6 +2721,17 @@ main(int argc, char **argv) } else if (!check_ptype(portid)) rte_exit(EXIT_FAILURE, "PMD can not provide needed ptypes\n"); + if (app_mode == APP_MODE_PMD_MGMT) { + ret = rte_power_pmd_mgmt_queue_enable(lcore_id, + portid, queueid, + RTE_POWER_MGMT_TYPE_SCALE); + + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_power_pmd_mgmt enable: err=%d, " + "port=%d\n", ret, portid); + + } } } @@ -2790,6 +2817,9 @@ main(int argc, char **argv) SKIP_MAIN); } else if (app_mode == APP_MODE_INTERRUPT) { rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN); + } else if (app_mode == APP_MODE_PMD_MGMT) { + rte_eal_mp_remote_launch(main_telemetry_loop, NULL, + CALL_MAIN); } if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY) @@ -2816,6 +2846,20 @@ main(int argc, char **argv) if (app_mode == APP_MODE_EMPTY_POLL) rte_power_empty_poll_stat_free(); + if (app_mode == APP_MODE_PMD_MGMT) { + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + qconf = &lcore_conf[lcore_id]; + for (queue = 0; queue < qconf->n_rx_queue; ++queue) { + portid = qconf->rx_queue_list[queue].port_id; + queueid = qconf->rx_queue_list[queue].queue_id; + rte_power_pmd_mgmt_queue_disable(lcore_id, + portid, queueid); + } + } + } + if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) && deinit_power_library()) rte_exit(EXIT_FAILURE, "deinit_power_library failed\n"); -- 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v12 00/11] Add PMD power management 2020-11-02 11:09 ` [dpdk-dev] [PATCH v11 0/6] Add PMD power mgmt Liang Ma ` (5 preceding siblings ...) 2020-11-02 11:10 ` [dpdk-dev] [PATCH v11 6/6] examples/l3fwd-power: enable PMD power mgmt Liang Ma @ 2020-12-17 14:05 ` Anatoly Burakov 2020-12-17 16:12 ` David Marchand 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov 2020-12-17 14:05 ` [dpdk-dev] [PATCH v12 01/11] eal: uninline power intrinsics Anatoly Burakov ` (10 subsequent siblings) 17 siblings, 2 replies; 421+ messages in thread From: Anatoly Burakov @ 2020-12-17 14:05 UTC (permalink / raw) To: dev Cc: thomas, konstantin.ananyev, gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara This patchset proposes a simple API for Ethernet drivers to cause the CPU to enter a power-optimized state while waiting for packets to arrive. This is achieved through cooperation with the NIC driver that will allow us to know address of wake up event, and wait for writes on it. On IA, this is achieved through using UMONITOR/UMWAIT instructions. They are used in their raw opcode form because there is no widespread compiler support for them yet. Still, the API is made generic enough to hopefully support other architectures, if they happen to implement similar instructions. To achieve power savings, there is a very simple mechanism used: we're counting empty polls, and if a certain threshold is reached, we get the address of next RX ring descriptor from the NIC driver, arm the monitoring hardware, and enter a power-optimized state. We will then wake up when either a timeout happens, or a write happens (or generally whenever CPU feels like waking up - this is platform-specific), and proceed as normal. The empty poll counter is reset whenever we actually get packets, so we only go to sleep when we know nothing is going on. The mechanism is generic which can be used for any write back descriptor. This patchset also introduces a few changes into existing power management-related intrinsics, namely to provide a native way of waking up a sleeping core without application being responsible for it, as well as general robustness improvements. There's quite a bit of locking going on, but these locks are per-thread and very little (if any) contention is expected, so the performance impact shouldn't be that bad (and in any case the locking happens when we're about to sleep anyway, not on a hotpath). Why are we putting it into ethdev as opposed to leaving this up to the application? Our customers specifically requested a way to do it wit minimal changes to the application code. The current approach allows to just flip a switch and automatically have power savings. - Only 1:1 core to queue mapping is supported, meaning that each lcore must at most handle RX on a single queue - Support 3 type policies. Monitor/Pause/Frequency Scaling - Power management is enabled per-queue - The API doesn't extend to other device types Anatoly Burakov (5): eal: uninline power intrinsics eal: avoid invalid API usage in power intrinsics eal: change API of power intrinsics eal: remove sync version of power monitor eal: add monitor wakeup function Liang Ma (6): ethdev: add simple power management API power: add PMD power management API and callback net/ixgbe: implement power management API net/i40e: implement power management API net/ice: implement power management API examples/l3fwd-power: enable PMD power mgmt doc/guides/prog_guide/power_man.rst | 44 +++ doc/guides/rel_notes/release_21_02.rst | 14 + .../sample_app_ug/l3_forward_power_man.rst | 35 ++ drivers/event/dlb/dlb.c | 10 +- drivers/event/dlb2/dlb2.c | 10 +- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 25 ++ drivers/net/i40e/i40e_rxtx.h | 1 + drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++ drivers/net/ice/ice_rxtx.h | 1 + drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 ++ drivers/net/ixgbe/ixgbe_rxtx.h | 1 + examples/l3fwd-power/main.c | 89 ++++- .../arm/include/rte_power_intrinsics.h | 39 +- .../include/generic/rte_power_intrinsics.h | 78 ++-- .../ppc/include/rte_power_intrinsics.h | 39 +- lib/librte_eal/version.map | 5 + .../x86/include/rte_power_intrinsics.h | 115 ------ lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 169 +++++++++ lib/librte_ethdev/rte_ethdev.c | 28 ++ lib/librte_ethdev/rte_ethdev.h | 25 ++ lib/librte_ethdev/rte_ethdev_driver.h | 22 ++ lib/librte_ethdev/version.map | 3 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 349 ++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++ lib/librte_power/version.map | 5 + 30 files changed, 1028 insertions(+), 229 deletions(-) create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h -- 2.17.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v12 00/11] Add PMD power management 2020-12-17 14:05 ` [dpdk-dev] [PATCH v12 00/11] Add PMD power management Anatoly Burakov @ 2020-12-17 16:12 ` David Marchand 2021-01-08 16:42 ` Burakov, Anatoly 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov 1 sibling, 1 reply; 421+ messages in thread From: David Marchand @ 2020-12-17 16:12 UTC (permalink / raw) To: Anatoly Burakov Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Gage Eads, Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara, Ray Kinsella, Yigit, Ferruh On Thu, Dec 17, 2020 at 3:06 PM Anatoly Burakov <anatoly.burakov@intel.com> wrote: > > This patchset proposes a simple API for Ethernet drivers to cause the > CPU to enter a power-optimized state while waiting for packets to > arrive. This is achieved through cooperation with the NIC driver that > will allow us to know address of wake up event, and wait for writes on > it. > > On IA, this is achieved through using UMONITOR/UMWAIT instructions. They > are used in their raw opcode form because there is no widespread > compiler support for them yet. Still, the API is made generic enough to > hopefully support other architectures, if they happen to implement > similar instructions. > > To achieve power savings, there is a very simple mechanism used: we're > counting empty polls, and if a certain threshold is reached, we get the > address of next RX ring descriptor from the NIC driver, arm the > monitoring hardware, and enter a power-optimized state. We will then > wake up when either a timeout happens, or a write happens (or generally > whenever CPU feels like waking up - this is platform-specific), and > proceed as normal. The empty poll counter is reset whenever we actually > get packets, so we only go to sleep when we know nothing is going on. > The mechanism is generic which can be used for any write back > descriptor. > > This patchset also introduces a few changes into existing power > management-related intrinsics, namely to provide a native way of waking > up a sleeping core without application being responsible for it, as well > as general robustness improvements. There's quite a bit of locking going > on, but these locks are per-thread and very little (if any) contention > is expected, so the performance impact shouldn't be that bad (and in any > case the locking happens when we're about to sleep anyway, not on a > hotpath). > > Why are we putting it into ethdev as opposed to leaving this up to the > application? Our customers specifically requested a way to do it wit > minimal changes to the application code. The current approach allows to > just flip a switch and automatically have power savings. > > - Only 1:1 core to queue mapping is supported, meaning that each lcore > must at most handle RX on a single queue > - Support 3 type policies. Monitor/Pause/Frequency Scaling > - Power management is enabled per-queue > - The API doesn't extend to other device types Fyi, ovsrobot Travis being KO, you probably missed that GHA CI caught this: https://github.com/ovsrobot/dpdk/runs/1571056574?check_suite_focus=true#step:13:16082 We will have to put an exception on driver only ABI. -- David Marchand ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v12 00/11] Add PMD power management 2020-12-17 16:12 ` David Marchand @ 2021-01-08 16:42 ` Burakov, Anatoly 2021-01-11 8:44 ` David Marchand 0 siblings, 1 reply; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-08 16:42 UTC (permalink / raw) To: David Marchand Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Gage Eads, Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara, Ray Kinsella, Yigit, Ferruh On 17-Dec-20 4:12 PM, David Marchand wrote: > On Thu, Dec 17, 2020 at 3:06 PM Anatoly Burakov > <anatoly.burakov@intel.com> wrote: >> >> This patchset proposes a simple API for Ethernet drivers to cause the >> CPU to enter a power-optimized state while waiting for packets to >> arrive. This is achieved through cooperation with the NIC driver that >> will allow us to know address of wake up event, and wait for writes on >> it. >> >> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They >> are used in their raw opcode form because there is no widespread >> compiler support for them yet. Still, the API is made generic enough to >> hopefully support other architectures, if they happen to implement >> similar instructions. >> >> To achieve power savings, there is a very simple mechanism used: we're >> counting empty polls, and if a certain threshold is reached, we get the >> address of next RX ring descriptor from the NIC driver, arm the >> monitoring hardware, and enter a power-optimized state. We will then >> wake up when either a timeout happens, or a write happens (or generally >> whenever CPU feels like waking up - this is platform-specific), and >> proceed as normal. The empty poll counter is reset whenever we actually >> get packets, so we only go to sleep when we know nothing is going on. >> The mechanism is generic which can be used for any write back >> descriptor. >> >> This patchset also introduces a few changes into existing power >> management-related intrinsics, namely to provide a native way of waking >> up a sleeping core without application being responsible for it, as well >> as general robustness improvements. There's quite a bit of locking going >> on, but these locks are per-thread and very little (if any) contention >> is expected, so the performance impact shouldn't be that bad (and in any >> case the locking happens when we're about to sleep anyway, not on a >> hotpath). >> >> Why are we putting it into ethdev as opposed to leaving this up to the >> application? Our customers specifically requested a way to do it wit >> minimal changes to the application code. The current approach allows to >> just flip a switch and automatically have power savings. >> >> - Only 1:1 core to queue mapping is supported, meaning that each lcore >> must at most handle RX on a single queue >> - Support 3 type policies. Monitor/Pause/Frequency Scaling >> - Power management is enabled per-queue >> - The API doesn't extend to other device types > > Fyi, ovsrobot Travis being KO, you probably missed that GHA CI caught this: > https://github.com/ovsrobot/dpdk/runs/1571056574?check_suite_focus=true#step:13:16082 > > We will have to put an exception on driver only ABI. > > Why does aarch64 build fail there? The functions in question are in the version map file, but the build complains that they aren't. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v12 00/11] Add PMD power management 2021-01-08 16:42 ` Burakov, Anatoly @ 2021-01-11 8:44 ` David Marchand 2021-01-11 8:52 ` David Marchand 0 siblings, 1 reply; 421+ messages in thread From: David Marchand @ 2021-01-11 8:44 UTC (permalink / raw) To: Burakov, Anatoly Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Gage Eads, Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara, Ray Kinsella, Yigit, Ferruh On Fri, Jan 8, 2021 at 5:42 PM Burakov, Anatoly <anatoly.burakov@intel.com> wrote: > > On 17-Dec-20 4:12 PM, David Marchand wrote: > > On Thu, Dec 17, 2020 at 3:06 PM Anatoly Burakov > > <anatoly.burakov@intel.com> wrote: > >> > >> This patchset proposes a simple API for Ethernet drivers to cause the > >> CPU to enter a power-optimized state while waiting for packets to > >> arrive. This is achieved through cooperation with the NIC driver that > >> will allow us to know address of wake up event, and wait for writes on > >> it. > >> > >> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They > >> are used in their raw opcode form because there is no widespread > >> compiler support for them yet. Still, the API is made generic enough to > >> hopefully support other architectures, if they happen to implement > >> similar instructions. > >> > >> To achieve power savings, there is a very simple mechanism used: we're > >> counting empty polls, and if a certain threshold is reached, we get the > >> address of next RX ring descriptor from the NIC driver, arm the > >> monitoring hardware, and enter a power-optimized state. We will then > >> wake up when either a timeout happens, or a write happens (or generally > >> whenever CPU feels like waking up - this is platform-specific), and > >> proceed as normal. The empty poll counter is reset whenever we actually > >> get packets, so we only go to sleep when we know nothing is going on. > >> The mechanism is generic which can be used for any write back > >> descriptor. > >> > >> This patchset also introduces a few changes into existing power > >> management-related intrinsics, namely to provide a native way of waking > >> up a sleeping core without application being responsible for it, as well > >> as general robustness improvements. There's quite a bit of locking going > >> on, but these locks are per-thread and very little (if any) contention > >> is expected, so the performance impact shouldn't be that bad (and in any > >> case the locking happens when we're about to sleep anyway, not on a > >> hotpath). > >> > >> Why are we putting it into ethdev as opposed to leaving this up to the > >> application? Our customers specifically requested a way to do it wit > >> minimal changes to the application code. The current approach allows to > >> just flip a switch and automatically have power savings. > >> > >> - Only 1:1 core to queue mapping is supported, meaning that each lcore > >> must at most handle RX on a single queue > >> - Support 3 type policies. Monitor/Pause/Frequency Scaling > >> - Power management is enabled per-queue > >> - The API doesn't extend to other device types > > > > Fyi, ovsrobot Travis being KO, you probably missed that GHA CI caught this: > > https://github.com/ovsrobot/dpdk/runs/1571056574?check_suite_focus=true#step:13:16082 > > > > We will have to put an exception on driver only ABI. > > > > > > Why does aarch64 build fail there? The functions in question are in the > version map file, but the build complains that they aren't. From what I can see, this series puts rte_power_* symbols in a .h. So it will be seen as symbols exported by any library including such a header. The check then complains about this as it sees exported symbols unknown of the library version.map. -- David Marchand ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v12 00/11] Add PMD power management 2021-01-11 8:44 ` David Marchand @ 2021-01-11 8:52 ` David Marchand 2021-01-11 10:21 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: David Marchand @ 2021-01-11 8:52 UTC (permalink / raw) To: Burakov, Anatoly Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Gage Eads, Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara, Ray Kinsella, Yigit, Ferruh On Mon, Jan 11, 2021 at 9:44 AM David Marchand <david.marchand@redhat.com> wrote: > > On Fri, Jan 8, 2021 at 5:42 PM Burakov, Anatoly > <anatoly.burakov@intel.com> wrote: > > Why does aarch64 build fail there? The functions in question are in the > > version map file, but the build complains that they aren't. > > From what I can see, this series puts rte_power_* symbols in a .h. > So it will be seen as symbols exported by any library including such a header. > > The check then complains about this as it sees exported symbols > unknown of the library version.map. Quick fix: diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h index 39e49cc45b..9e498e9ebf 100644 --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h @@ -13,35 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -/** - * This function is not supported on ARM. - */ -void -rte_power_monitor(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp) -{ - RTE_SET_USED(pmc); - RTE_SET_USED(tsc_timestamp); -} - -/** - * This function is not supported on ARM. - */ -void -rte_power_pause(const uint64_t tsc_timestamp) -{ - RTE_SET_USED(tsc_timestamp); -} - -/** - * This function is not supported on ARM. - */ -void -rte_power_monitor_wakeup(const unsigned int lcore_id) -{ - RTE_SET_USED(lcore_id); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build index d62875ebae..6ec53ea03a 100644 --- a/lib/librte_eal/arm/meson.build +++ b/lib/librte_eal/arm/meson.build @@ -7,4 +7,5 @@ sources += files( 'rte_cpuflags.c', 'rte_cycles.c', 'rte_hypervisor.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c new file mode 100644 index 0000000000..998f9898ad --- /dev/null +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -0,0 +1,31 @@ +#include <rte_common.h> +#include <rte_power_intrinsics.h> + +/** + * This function is not supported on ARM. + */ +void +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) +{ + RTE_SET_USED(pmc); + RTE_SET_USED(tsc_timestamp); +} + +/** + * This function is not supported on ARM. + */ +void +rte_power_pause(const uint64_t tsc_timestamp) +{ + RTE_SET_USED(tsc_timestamp); +} + +/** + * This function is not supported on ARM. + */ +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + RTE_SET_USED(lcore_id); +} HTH. -- David Marchand ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v12 00/11] Add PMD power management 2021-01-11 8:52 ` David Marchand @ 2021-01-11 10:21 ` Burakov, Anatoly 0 siblings, 0 replies; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-11 10:21 UTC (permalink / raw) To: David Marchand Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Gage Eads, Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara, Ray Kinsella, Yigit, Ferruh On 11-Jan-21 8:52 AM, David Marchand wrote: > On Mon, Jan 11, 2021 at 9:44 AM David Marchand > <david.marchand@redhat.com> wrote: >> >> On Fri, Jan 8, 2021 at 5:42 PM Burakov, Anatoly >> <anatoly.burakov@intel.com> wrote: >>> Why does aarch64 build fail there? The functions in question are in the >>> version map file, but the build complains that they aren't. >> >> From what I can see, this series puts rte_power_* symbols in a .h. >> So it will be seen as symbols exported by any library including such a header. >> >> The check then complains about this as it sees exported symbols >> unknown of the library version.map. > > Quick fix: > > diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h > b/lib/librte_eal/arm/include/rte_power_intrinsics.h > index 39e49cc45b..9e498e9ebf 100644 > --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h > +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h > @@ -13,35 +13,6 @@ extern "C" { > > #include "generic/rte_power_intrinsics.h" > > -/** > - * This function is not supported on ARM. > - */ > -void > -rte_power_monitor(const struct rte_power_monitor_cond *pmc, > - const uint64_t tsc_timestamp) > -{ > - RTE_SET_USED(pmc); > - RTE_SET_USED(tsc_timestamp); > -} > - > -/** > - * This function is not supported on ARM. > - */ > -void > -rte_power_pause(const uint64_t tsc_timestamp) > -{ > - RTE_SET_USED(tsc_timestamp); > -} > - > -/** > - * This function is not supported on ARM. > - */ > -void > -rte_power_monitor_wakeup(const unsigned int lcore_id) > -{ > - RTE_SET_USED(lcore_id); > -} > - > #ifdef __cplusplus > } > #endif > diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build > index d62875ebae..6ec53ea03a 100644 > --- a/lib/librte_eal/arm/meson.build > +++ b/lib/librte_eal/arm/meson.build > @@ -7,4 +7,5 @@ sources += files( > 'rte_cpuflags.c', > 'rte_cycles.c', > 'rte_hypervisor.c', > + 'rte_power_intrinsics.c', > ) > diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c > b/lib/librte_eal/arm/rte_power_intrinsics.c > new file mode 100644 > index 0000000000..998f9898ad > --- /dev/null > +++ b/lib/librte_eal/arm/rte_power_intrinsics.c > @@ -0,0 +1,31 @@ > +#include <rte_common.h> > +#include <rte_power_intrinsics.h> > + > +/** > + * This function is not supported on ARM. > + */ > +void > +rte_power_monitor(const struct rte_power_monitor_cond *pmc, > + const uint64_t tsc_timestamp) > +{ > + RTE_SET_USED(pmc); > + RTE_SET_USED(tsc_timestamp); > +} > + > +/** > + * This function is not supported on ARM. > + */ > +void > +rte_power_pause(const uint64_t tsc_timestamp) > +{ > + RTE_SET_USED(tsc_timestamp); > +} > + > +/** > + * This function is not supported on ARM. > + */ > +void > +rte_power_monitor_wakeup(const unsigned int lcore_id) > +{ > + RTE_SET_USED(lcore_id); > +} > > > HTH. > > OK, will add into v14 so. Thanks! -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 00/11] Add PMD power management 2020-12-17 14:05 ` [dpdk-dev] [PATCH v12 00/11] Add PMD power management Anatoly Burakov 2020-12-17 16:12 ` David Marchand @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics Anatoly Burakov ` (11 more replies) 1 sibling, 12 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: thomas, konstantin.ananyev, gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara This patchset proposes a simple API for Ethernet drivers to cause the CPU to enter a power-optimized state while waiting for packets to arrive. There are multiple proposed mechanisms to achieve said power savings: simple frequency scaling, idle loop, and monitoring the Rx queue for incoming packages. The latter is achieved through cooperation with the NIC driver that will allow us to know address of wake up event, and wait for writes on that address. On IA, this is achieved through using UMONITOR/UMWAIT instructions. They are used in their raw opcode form because there is no widespread compiler support for them yet. Still, the API is made generic enough to hopefully support other architectures, if they happen to implement similar instructions. To achieve power savings, there is a very simple mechanism used: we're counting empty polls, and if a certain threshold is reached, we employ one of the suggested power management schemes automatically, from within a Rx callback inside the PMD. Once there's traffic again, the empty poll counter is reset. This patchset also introduces a few changes into existing power management-related intrinsics, namely to provide a native way of waking up a sleeping core without application being responsible for it, as well as general robustness improvements. There's quite a bit of locking going on, but these locks are per-thread and very little (if any) contention is expected, so the performance impact shouldn't be that bad (and in any case the locking happens when we're about to sleep anyway). Why are we putting it into ethdev as opposed to leaving this up to the application? Our customers specifically requested a way to do it with minimal changes to the application code. The current approach allows to just flip a switch and automatically have power savings. Things of note: - Only 1:1 core to queue mapping is supported, meaning that each lcore must at most handle RX on a single queue - Support 3 type policies. Monitor/Pause/Frequency Scaling - Power management is enabled per-queue - The API doesn't extend to other device types v13: - Reworked the librte_power code to require less locking and handle invalid parameters better - Fix numerous rebase errors present in v12 v12: - Rebase on top of 21.02 - Rework of power intrinsics code Anatoly Burakov (5): eal: uninline power intrinsics eal: avoid invalid API usage in power intrinsics eal: change API of power intrinsics eal: remove sync version of power monitor eal: add monitor wakeup function Liang Ma (6): ethdev: add simple power management API power: add PMD power management API and callback net/ixgbe: implement power management API net/i40e: implement power management API net/ice: implement power management API examples/l3fwd-power: enable PMD power mgmt doc/guides/prog_guide/power_man.rst | 44 +++ doc/guides/rel_notes/release_21_02.rst | 15 + .../sample_app_ug/l3_forward_power_man.rst | 35 ++ drivers/event/dlb/dlb.c | 10 +- drivers/event/dlb2/dlb2.c | 10 +- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 25 ++ drivers/net/i40e/i40e_rxtx.h | 1 + drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++ drivers/net/ice/ice_rxtx.h | 1 + drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 ++ drivers/net/ixgbe/ixgbe_rxtx.h | 1 + examples/l3fwd-power/main.c | 89 ++++- .../arm/include/rte_power_intrinsics.h | 39 +- .../include/generic/rte_power_intrinsics.h | 78 ++-- .../ppc/include/rte_power_intrinsics.h | 39 +- lib/librte_eal/version.map | 5 + .../x86/include/rte_power_intrinsics.h | 115 ------ lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 184 +++++++++ lib/librte_ethdev/rte_ethdev.c | 28 ++ lib/librte_ethdev/rte_ethdev.h | 25 ++ lib/librte_ethdev/rte_ethdev_driver.h | 22 ++ lib/librte_ethdev/version.map | 3 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 360 ++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++ lib/librte_power/version.map | 5 + 30 files changed, 1055 insertions(+), 229 deletions(-) create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-12 15:54 ` Ananyev, Konstantin 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in " Anatoly Burakov ` (10 subsequent siblings) 11 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara Currently, power intrinsics are inline functions. Make them part of the ABI so that we can have various internal data associated with them without exposing said data to the outside world. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- .../arm/include/rte_power_intrinsics.h | 6 +- .../include/generic/rte_power_intrinsics.h | 6 +- .../ppc/include/rte_power_intrinsics.h | 6 +- lib/librte_eal/version.map | 5 + .../x86/include/rte_power_intrinsics.h | 115 ----------------- lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 120 ++++++++++++++++++ 7 files changed, 135 insertions(+), 124 deletions(-) create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h index a4a1bc1159..5e384d380e 100644 --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h @@ -16,7 +16,7 @@ extern "C" { /** * This function is not supported on ARM. */ -static inline void +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz) @@ -31,7 +31,7 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, /** * This function is not supported on ARM. */ -static inline void +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck) @@ -47,7 +47,7 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, /** * This function is not supported on ARM. */ -static inline void +void rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index dd520d90fa..67977bd511 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -52,7 +52,7 @@ * to undefined result. */ __rte_experimental -static inline void rte_power_monitor(const volatile void *p, +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz); @@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p, * wakes up. */ __rte_experimental -static inline void rte_power_monitor_sync(const volatile void *p, +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck); @@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p, * architecture-dependent. */ __rte_experimental -static inline void rte_power_pause(const uint64_t tsc_timestamp); +void rte_power_pause(const uint64_t tsc_timestamp); #endif /* _RTE_POWER_INTRINSIC_H_ */ diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h index 4ed03d521f..4cb5560c02 100644 --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h @@ -16,7 +16,7 @@ extern "C" { /** * This function is not supported on PPC64. */ -static inline void +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz) @@ -31,7 +31,7 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, /** * This function is not supported on PPC64. */ -static inline void +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck) @@ -47,7 +47,7 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, /** * This function is not supported on PPC64. */ -static inline void +void rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 354c068f31..31bf76ae81 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -403,6 +403,11 @@ EXPERIMENTAL { rte_service_lcore_may_be_active; rte_vect_get_max_simd_bitwidth; rte_vect_set_max_simd_bitwidth; + + # added in 21.02 + rte_power_monitor; + rte_power_monitor_sync; + rte_power_pause; }; INTERNAL { diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h index c7d790c854..e4c2b87f73 100644 --- a/lib/librte_eal/x86/include/rte_power_intrinsics.h +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h @@ -13,121 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -static inline uint64_t -__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz) -{ - switch (sz) { - case sizeof(uint8_t): - return *(const volatile uint8_t *)p; - case sizeof(uint16_t): - return *(const volatile uint16_t *)p; - case sizeof(uint32_t): - return *(const volatile uint32_t *)p; - case sizeof(uint64_t): - return *(const volatile uint64_t *)p; - default: - /* this is an intrinsic, so we can't have any error handling */ - RTE_ASSERT(0); - return 0; - } -} - -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(p)); - - if (value_mask) { - const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; - - /* if the masked value is already matching, abort */ - if (masked == expected_value) - return; - } - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); -} - -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(p)); - - if (value_mask) { - const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; - - /* if the masked value is already matching, abort */ - if (masked == expected_value) - return; - } - rte_spinlock_unlock(lck); - - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); - - rte_spinlock_lock(lck); -} - -/** - * This function uses TPAUSE instruction and will enter C0.2 state. For more - * information about usage of this instruction, please refer to Intel(R) 64 and - * IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - - /* execute TPAUSE */ - asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build index e78f29002e..dfd42dee0c 100644 --- a/lib/librte_eal/x86/meson.build +++ b/lib/librte_eal/x86/meson.build @@ -8,4 +8,5 @@ sources += files( 'rte_cycles.c', 'rte_hypervisor.c', 'rte_spinlock.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c new file mode 100644 index 0000000000..34c5fd9c3e --- /dev/null +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -0,0 +1,120 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +static inline uint64_t +__get_umwait_val(const volatile void *p, const uint8_t sz) +{ + switch (sz) { + case sizeof(uint8_t): + return *(const volatile uint8_t *)p; + case sizeof(uint16_t): + return *(const volatile uint16_t *)p; + case sizeof(uint32_t): + return *(const volatile uint32_t *)p; + case sizeof(uint64_t): + return *(const volatile uint64_t *)p; + default: + /* this is an intrinsic, so we can't have any error handling */ + RTE_ASSERT(0); + return 0; + } +} + +/** + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. + * For more information about usage of these instructions, please refer to + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + + if (value_mask) { + const uint64_t cur_value = __get_umwait_val(p, data_sz); + const uint64_t masked = cur_value & value_mask; + + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return; + } + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} + +/** + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. + * For more information about usage of these instructions, please refer to + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + + if (value_mask) { + const uint64_t cur_value = __get_umwait_val(p, data_sz); + const uint64_t masked = cur_value & value_mask; + + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return; + } + rte_spinlock_unlock(lck); + + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); + + rte_spinlock_lock(lck); +} + +/** + * This function uses TPAUSE instruction and will enter C0.2 state. For more + * information about usage of this instruction, please refer to Intel(R) 64 and + * IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_pause(const uint64_t tsc_timestamp) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* execute TPAUSE */ + asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics Anatoly Burakov @ 2021-01-12 15:54 ` Ananyev, Konstantin 0 siblings, 0 replies; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-12 15:54 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, McDaniel, Timothy, Hunt, David, Macnamara, Chris > -----Original Message----- > From: Burakov, Anatoly <anatoly.burakov@intel.com> > Sent: Friday, January 8, 2021 5:42 PM > To: dev@dpdk.org > Cc: Jan Viktorin <viktorin@rehivetech.com>; Ruifeng Wang <ruifeng.wang@arm.com>; Jerin Jacob <jerinj@marvell.com>; David > Christensen <drc@linux.vnet.ibm.com>; Ray Kinsella <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>; Richardson, Bruce > <bruce.richardson@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; thomas@monjalon.net; gage.eads@intel.com; > McDaniel, Timothy <timothy.mcdaniel@intel.com>; Hunt, David <david.hunt@intel.com>; Macnamara, Chris > <chris.macnamara@intel.com> > Subject: [PATCH v13 01/11] eal: uninline power intrinsics > > Currently, power intrinsics are inline functions. Make them part of the > ABI so that we can have various internal data associated with them > without exposing said data to the outside world. > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > --- Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > -- > 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in power intrinsics 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-08 19:58 ` Stephen Hemminger 2021-01-12 15:56 ` Ananyev, Konstantin 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 03/11] eal: change API of " Anatoly Burakov ` (9 subsequent siblings) 11 siblings, 2 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Bruce Richardson, Konstantin Ananyev, thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara Currently, the API documentation mandates that if the user wants to use the power management intrinsics, they need to call the `rte_cpu_get_intrinsics_support` API and check support for specific intrinsics. However, if the user does not do that, it is possible to get illegal instruction error because we're using raw instruction opcodes, which may or may not be supported at runtime. Now that we have everything in a C file, we can check for support at startup and prevent the user from possibly encountering illegal instruction errors. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- .../include/generic/rte_power_intrinsics.h | 3 -- lib/librte_eal/x86/rte_power_intrinsics.c | 31 +++++++++++++++++-- 2 files changed, 28 insertions(+), 6 deletions(-) diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 67977bd511..ffa72f7578 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -34,7 +34,6 @@ * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param p * Address to monitor for changes. @@ -75,7 +74,6 @@ void rte_power_monitor(const volatile void *p, * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param p * Address to monitor for changes. @@ -111,7 +109,6 @@ void rte_power_monitor_sync(const volatile void *p, * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 34c5fd9c3e..b48a54ec7f 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -4,6 +4,8 @@ #include "rte_power_intrinsics.h" +static uint8_t wait_supported; + static inline uint64_t __get_umwait_val(const volatile void *p, const uint8_t sz) { @@ -35,6 +37,11 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -72,6 +79,11 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -112,9 +124,22 @@ rte_power_pause(const uint64_t tsc_timestamp) const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + /* execute TPAUSE */ asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} + +RTE_INIT(rte_power_intrinsics_init) { + struct rte_cpu_intrinsics i; + + rte_cpu_get_intrinsics_support(&i); + + if (i.power_monitor && i.power_pause) + wait_supported = 1; } -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in power intrinsics 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in " Anatoly Burakov @ 2021-01-08 19:58 ` Stephen Hemminger 2021-01-11 10:21 ` Burakov, Anatoly 2021-01-12 15:56 ` Ananyev, Konstantin 1 sibling, 1 reply; 421+ messages in thread From: Stephen Hemminger @ 2021-01-08 19:58 UTC (permalink / raw) To: Anatoly Burakov Cc: dev, Bruce Richardson, Konstantin Ananyev, thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara On Fri, 8 Jan 2021 17:42:05 +0000 Anatoly Burakov <anatoly.burakov@intel.com> wrote: > +static uint8_t wait_supported; Since it is being used as a flag, bool is more common usage. ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in power intrinsics 2021-01-08 19:58 ` Stephen Hemminger @ 2021-01-11 10:21 ` Burakov, Anatoly 0 siblings, 0 replies; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-11 10:21 UTC (permalink / raw) To: Stephen Hemminger Cc: dev, Bruce Richardson, Konstantin Ananyev, thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara On 08-Jan-21 7:58 PM, Stephen Hemminger wrote: > On Fri, 8 Jan 2021 17:42:05 +0000 > Anatoly Burakov <anatoly.burakov@intel.com> wrote: > >> +static uint8_t wait_supported; > > Since it is being used as a flag, bool is more common usage. > Will fix in v14, thanks! -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in power intrinsics 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in " Anatoly Burakov 2021-01-08 19:58 ` Stephen Hemminger @ 2021-01-12 15:56 ` Ananyev, Konstantin 1 sibling, 0 replies; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-12 15:56 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: Richardson, Bruce, thomas, gage.eads, McDaniel, Timothy, Hunt, David, Macnamara, Chris > Currently, the API documentation mandates that if the user wants to use > the power management intrinsics, they need to call the > `rte_cpu_get_intrinsics_support` API and check support for specific > intrinsics. > > However, if the user does not do that, it is possible to get illegal > instruction error because we're using raw instruction opcodes, which may > or may not be supported at runtime. > > Now that we have everything in a C file, we can check for support at > startup and prevent the user from possibly encountering illegal > instruction errors. > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > --- > .../include/generic/rte_power_intrinsics.h | 3 -- > lib/librte_eal/x86/rte_power_intrinsics.c | 31 +++++++++++++++++-- > 2 files changed, 28 insertions(+), 6 deletions(-) > > diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h > index 67977bd511..ffa72f7578 100644 > --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h > +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h > @@ -34,7 +34,6 @@ > * > * @warning It is responsibility of the user to check if this function is > * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. > - * Failing to do so may result in an illegal CPU instruction error. > * > * @param p > * Address to monitor for changes. > @@ -75,7 +74,6 @@ void rte_power_monitor(const volatile void *p, > * > * @warning It is responsibility of the user to check if this function is > * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. > - * Failing to do so may result in an illegal CPU instruction error. > * > * @param p > * Address to monitor for changes. > @@ -111,7 +109,6 @@ void rte_power_monitor_sync(const volatile void *p, > * > * @warning It is responsibility of the user to check if this function is > * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. > - * Failing to do so may result in an illegal CPU instruction error. > * > * @param tsc_timestamp > * Maximum TSC timestamp to wait for. Note that the wait behavior is > diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c > index 34c5fd9c3e..b48a54ec7f 100644 > --- a/lib/librte_eal/x86/rte_power_intrinsics.c > +++ b/lib/librte_eal/x86/rte_power_intrinsics.c > @@ -4,6 +4,8 @@ > > #include "rte_power_intrinsics.h" > > +static uint8_t wait_supported; > + > static inline uint64_t > __get_umwait_val(const volatile void *p, const uint8_t sz) > { > @@ -35,6 +37,11 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, > { > const uint32_t tsc_l = (uint32_t)tsc_timestamp; > const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); > + > + /* prevent user from running this instruction if it's not supported */ > + if (!wait_supported) > + return; > + > /* > * we're using raw byte codes for now as only the newest compiler > * versions support this instruction natively. > @@ -72,6 +79,11 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, > { > const uint32_t tsc_l = (uint32_t)tsc_timestamp; > const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); > + > + /* prevent user from running this instruction if it's not supported */ > + if (!wait_supported) > + return; > + > /* > * we're using raw byte codes for now as only the newest compiler > * versions support this instruction natively. > @@ -112,9 +124,22 @@ rte_power_pause(const uint64_t tsc_timestamp) > const uint32_t tsc_l = (uint32_t)tsc_timestamp; > const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); > > + /* prevent user from running this instruction if it's not supported */ > + if (!wait_supported) > + return; > + > /* execute TPAUSE */ > asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" > - : /* ignore rflags */ > - : "D"(0), /* enter C0.2 */ > - "a"(tsc_l), "d"(tsc_h)); > + : /* ignore rflags */ > + : "D"(0), /* enter C0.2 */ > + "a"(tsc_l), "d"(tsc_h)); > +} > + > +RTE_INIT(rte_power_intrinsics_init) { > + struct rte_cpu_intrinsics i; > + > + rte_cpu_get_intrinsics_support(&i); > + > + if (i.power_monitor && i.power_pause) > + wait_supported = 1; > } > -- Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 03/11] eal: change API of power intrinsics 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 01/11] eal: uninline power intrinsics Anatoly Burakov 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 02/11] eal: avoid invalid API usage in " Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-12 15:58 ` Ananyev, Konstantin 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor Anatoly Burakov ` (8 subsequent siblings) 11 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, thomas, gage.eads, david.hunt, chris.macnamara Instead of passing around pointers and integers, collect everything into struct. This makes API design around these intrinsics much easier. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- drivers/event/dlb/dlb.c | 10 ++-- drivers/event/dlb2/dlb2.c | 10 ++-- .../arm/include/rte_power_intrinsics.h | 20 +++----- .../include/generic/rte_power_intrinsics.h | 49 ++++++++----------- .../ppc/include/rte_power_intrinsics.h | 20 +++----- lib/librte_eal/x86/rte_power_intrinsics.c | 32 ++++++------ 6 files changed, 62 insertions(+), 79 deletions(-) diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c index 0c95c4793d..d2f2026291 100644 --- a/drivers/event/dlb/dlb.c +++ b/drivers/event/dlb/dlb.c @@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb, /* Interrupts not supported by PF PMD */ return 1; } else if (dlb->umwait_allowed) { + struct rte_power_monitor_cond pmc; volatile struct dlb_dequeue_qe *cq_base; union { uint64_t raw_qe[2]; @@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb, else expected_value = 0; - rte_power_monitor(monitor_addr, expected_value, - qe_mask.raw_qe[1], timeout + start_ticks, - sizeof(uint64_t)); + pmc.addr = monitor_addr; + pmc.val = expected_value; + pmc.mask = qe_mask.raw_qe[1]; + pmc.data_sz = sizeof(uint64_t); + + rte_power_monitor(&pmc, timeout + start_ticks); DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1); } else { diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c index 86724863f2..c9a8a02278 100644 --- a/drivers/event/dlb2/dlb2.c +++ b/drivers/event/dlb2/dlb2.c @@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, if (elapsed_ticks >= timeout) { return 1; } else if (dlb2->umwait_allowed) { + struct rte_power_monitor_cond pmc; volatile struct dlb2_dequeue_qe *cq_base; union { uint64_t raw_qe[2]; @@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, else expected_value = 0; - rte_power_monitor(monitor_addr, expected_value, - qe_mask.raw_qe[1], timeout + start_ticks, - sizeof(uint64_t)); + pmc.addr = monitor_addr; + pmc.val = expected_value; + pmc.mask = qe_mask.raw_qe[1]; + pmc.data_sz = sizeof(uint64_t); + + rte_power_monitor(&pmc, timeout + start_ticks); DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1); } else { diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h index 5e384d380e..76a5fa5234 100644 --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h @@ -17,31 +17,23 @@ extern "C" { * This function is not supported on ARM. */ void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); } /** * This function is not supported on ARM. */ void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); - RTE_SET_USED(data_sz); } /** diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index ffa72f7578..00c670cb50 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -18,6 +18,18 @@ * which are architecture-dependent. */ +struct rte_power_monitor_cond { + volatile void *addr; /**< Address to monitor for changes */ + uint64_t val; /**< Before attempting the monitoring, the address + * may be read and compared against this value. + **/ + uint64_t mask; /**< 64-bit mask to extract current value from addr */ + uint8_t data_sz; /**< Data size (in bytes) that will be used to compare + * expected value with the memory address. Can be 1, + * 2, 4, or 8. Supplying any other value will lead to + * undefined result. */ +}; + /** * @warning * @b EXPERIMENTAL: this API may change without prior notice @@ -35,25 +47,15 @@ * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. * - * @param p - * Address to monitor for changes. - * @param expected_value - * Before attempting the monitoring, the `p` address may be read and compared - * against this value. If `value_mask` is zero, this step will be skipped. - * @param value_mask - * The 64-bit mask to use to extract current value from `p`. + * @param pmc + * The monitoring condition structure. * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. - * @param data_sz - * Data size (in bytes) that will be used to compare expected value with the - * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead - * to undefined result. */ __rte_experimental -void rte_power_monitor(const volatile void *p, - const uint64_t expected_value, const uint64_t value_mask, - const uint64_t tsc_timestamp, const uint8_t data_sz); +void rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp); /** * @warning @@ -75,30 +77,19 @@ void rte_power_monitor(const volatile void *p, * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. * - * @param p - * Address to monitor for changes. - * @param expected_value - * Before attempting the monitoring, the `p` address may be read and compared - * against this value. If `value_mask` is zero, this step will be skipped. - * @param value_mask - * The 64-bit mask to use to extract current value from `p`. + * @param pmc + * The monitoring condition structure. * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. - * @param data_sz - * Data size (in bytes) that will be used to compare expected value with the - * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead - * to undefined result. * @param lck * A spinlock that must be locked before entering the function, will be * unlocked while the CPU is sleeping, and will be locked again once the CPU * wakes up. */ __rte_experimental -void rte_power_monitor_sync(const volatile void *p, - const uint64_t expected_value, const uint64_t value_mask, - const uint64_t tsc_timestamp, const uint8_t data_sz, - rte_spinlock_t *lck); +void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck); /** * @warning diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h index 4cb5560c02..cff0996770 100644 --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h @@ -17,31 +17,23 @@ extern "C" { * This function is not supported on PPC64. */ void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); } /** * This function is not supported on PPC64. */ void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); - RTE_SET_USED(data_sz); } /** diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index b48a54ec7f..3e224f5ac7 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -31,9 +31,8 @@ __get_umwait_val(const volatile void *p, const uint8_t sz) * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); @@ -50,14 +49,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, /* set address for UMONITOR */ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" : - : "D"(p)); + : "D"(pmc->addr)); - if (value_mask) { - const uint64_t cur_value = __get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; + if (pmc->mask) { + const uint64_t cur_value = __get_umwait_val( + pmc->addr, pmc->data_sz); + const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == expected_value) + if (masked == pmc->val) return; } /* execute UMWAIT */ @@ -73,9 +73,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); @@ -92,14 +91,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, /* set address for UMONITOR */ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" : - : "D"(p)); + : "D"(pmc->addr)); - if (value_mask) { - const uint64_t cur_value = __get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; + if (pmc->mask) { + const uint64_t cur_value = __get_umwait_val( + pmc->addr, pmc->data_sz); + const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == expected_value) + if (masked == pmc->val) return; } rte_spinlock_unlock(lck); -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v13 03/11] eal: change API of power intrinsics 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 03/11] eal: change API of " Anatoly Burakov @ 2021-01-12 15:58 ` Ananyev, Konstantin 0 siblings, 0 replies; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-12 15:58 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: McDaniel, Timothy, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Richardson, Bruce, thomas, gage.eads, Hunt, David, Macnamara, Chris > > Instead of passing around pointers and integers, collect everything > into struct. This makes API design around these intrinsics much easier. > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > --- Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > -- > 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov ` (2 preceding siblings ...) 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 03/11] eal: change API of " Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-12 15:59 ` Ananyev, Konstantin 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function Anatoly Burakov ` (7 subsequent siblings) 11 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara Currently, the "sync" version of power monitor intrinsic is supposed to be used for purposes of waking up a sleeping core. However, there are better ways to achieve the same result, so remove the unneeded function. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- .../arm/include/rte_power_intrinsics.h | 12 ----- .../include/generic/rte_power_intrinsics.h | 34 -------------- .../ppc/include/rte_power_intrinsics.h | 12 ----- lib/librte_eal/version.map | 1 - lib/librte_eal/x86/rte_power_intrinsics.c | 46 ------------------- 5 files changed, 105 deletions(-) diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h index 76a5fa5234..27869251a8 100644 --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h @@ -24,18 +24,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, RTE_SET_USED(tsc_timestamp); } -/** - * This function is not supported on ARM. - */ -void -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - RTE_SET_USED(pmc); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); -} - /** * This function is not supported on ARM. */ diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 00c670cb50..a6f1955996 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -57,40 +57,6 @@ __rte_experimental void rte_power_monitor(const struct rte_power_monitor_cond *pmc, const uint64_t tsc_timestamp); -/** - * @warning - * @b EXPERIMENTAL: this API may change without prior notice - * - * Monitor specific address for changes. This will cause the CPU to enter an - * architecture-defined optimized power state until either the specified - * memory address is written to, a certain TSC timestamp is reached, or other - * reasons cause the CPU to wake up. - * - * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If - * mask is non-zero, the current value pointed to by the `p` pointer will be - * checked against the expected value, and if they match, the entering of - * optimized power state may be aborted. - * - * This call will also lock a spinlock on entering sleep, and release it on - * waking up the CPU. - * - * @warning It is responsibility of the user to check if this function is - * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * - * @param pmc - * The monitoring condition structure. - * @param tsc_timestamp - * Maximum TSC timestamp to wait for. Note that the wait behavior is - * architecture-dependent. - * @param lck - * A spinlock that must be locked before entering the function, will be - * unlocked while the CPU is sleeping, and will be locked again once the CPU - * wakes up. - */ -__rte_experimental -void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck); - /** * @warning * @b EXPERIMENTAL: this API may change without prior notice diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h index cff0996770..248d1f4a23 100644 --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h @@ -24,18 +24,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, RTE_SET_USED(tsc_timestamp); } -/** - * This function is not supported on PPC64. - */ -void -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - RTE_SET_USED(pmc); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); -} - /** * This function is not supported on PPC64. */ diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 31bf76ae81..20945b1efa 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -406,7 +406,6 @@ EXPERIMENTAL { # added in 21.02 rte_power_monitor; - rte_power_monitor_sync; rte_power_pause; }; diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 3e224f5ac7..a9cd1afe9d 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -67,52 +67,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, "a"(tsc_l), "d"(tsc_h)); } -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -void -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - - /* prevent user from running this instruction if it's not supported */ - if (!wait_supported) - return; - - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(pmc->addr)); - - if (pmc->mask) { - const uint64_t cur_value = __get_umwait_val( - pmc->addr, pmc->data_sz); - const uint64_t masked = cur_value & pmc->mask; - - /* if the masked value is already matching, abort */ - if (masked == pmc->val) - return; - } - rte_spinlock_unlock(lck); - - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); - - rte_spinlock_lock(lck); -} - /** * This function uses TPAUSE instruction and will enter C0.2 state. For more * information about usage of this instruction, please refer to Intel(R) 64 and -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor Anatoly Burakov @ 2021-01-12 15:59 ` Ananyev, Konstantin 0 siblings, 0 replies; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-12 15:59 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, gage.eads, McDaniel, Timothy, Hunt, David, Macnamara, Chris > > Currently, the "sync" version of power monitor intrinsic is supposed to > be used for purposes of waking up a sleeping core. However, there are > better ways to achieve the same result, so remove the unneeded function. > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > --- Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > -- > 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov ` (3 preceding siblings ...) 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 04/11] eal: remove sync version of power monitor Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-12 16:02 ` Ananyev, Konstantin 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API Anatoly Burakov ` (6 subsequent siblings) 11 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, gage.eads, timothy.mcdaniel, david.hunt, chris.macnamara Now that we have everything in a C file, we can store the information about our sleep, and have a native mechanism to wake up the sleeping core. This mechanism would however only wake up a core that's sleeping while monitoring - waking up from `rte_power_pause` won't work. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v13: - Add comments around wakeup code to explain what it does - Add lcore_id parameter checking to prevent buffer overrun .../arm/include/rte_power_intrinsics.h | 9 ++ .../include/generic/rte_power_intrinsics.h | 16 ++++ .../ppc/include/rte_power_intrinsics.h | 9 ++ lib/librte_eal/version.map | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 85 +++++++++++++++++++ 5 files changed, 120 insertions(+) diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h index 27869251a8..39e49cc45b 100644 --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp) RTE_SET_USED(tsc_timestamp); } +/** + * This function is not supported on ARM. + */ +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + RTE_SET_USED(lcore_id); +} + #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index a6f1955996..e311d6f8ea 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -57,6 +57,22 @@ __rte_experimental void rte_power_monitor(const struct rte_power_monitor_cond *pmc, const uint64_t tsc_timestamp); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Wake up a specific lcore that is in a power optimized state and is monitoring + * an address. + * + * @note This function will *not* wake up a core that is in a power optimized + * state due to calling `rte_power_pause`. + * + * @param lcore_id + * Lcore ID of a sleeping thread. + */ +__rte_experimental +void rte_power_monitor_wakeup(const unsigned int lcore_id); + /** * @warning * @b EXPERIMENTAL: this API may change without prior notice diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h index 248d1f4a23..2e7db0e7eb 100644 --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp) RTE_SET_USED(tsc_timestamp); } +/** + * This function is not supported on PPC64. + */ +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + RTE_SET_USED(lcore_id); +} + #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 20945b1efa..ac026e289d 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -406,6 +406,7 @@ EXPERIMENTAL { # added in 21.02 rte_power_monitor; + rte_power_monitor_wakeup; rte_power_pause; }; diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index a9cd1afe9d..46a4fb6cd5 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -2,8 +2,31 @@ * Copyright(c) 2020 Intel Corporation */ +#include <rte_common.h> +#include <rte_lcore.h> +#include <rte_spinlock.h> + #include "rte_power_intrinsics.h" +/* + * Per-lcore structure holding current status of C0.2 sleeps. + */ +static struct power_wait_status { + rte_spinlock_t lock; + volatile void *monitor_addr; /**< NULL if not currently sleeping */ +} __rte_cache_aligned wait_status[RTE_MAX_LCORE]; + +static inline void +__umwait_wakeup(volatile void *addr) +{ + uint64_t val; + + /* trigger a write but don't change the value */ + val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED); + __atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0, + __ATOMIC_RELAXED, __ATOMIC_RELAXED); +} + static uint8_t wait_supported; static inline uint64_t @@ -36,6 +59,12 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + const unsigned int lcore_id = rte_lcore_id(); + struct power_wait_status *s; + + /* prevent non-EAL thread from using this API */ + if (lcore_id >= RTE_MAX_LCORE) + return; /* prevent user from running this instruction if it's not supported */ if (!wait_supported) @@ -60,11 +89,24 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, if (masked == pmc->val) return; } + + s = &wait_status[lcore_id]; + + /* update sleep address */ + rte_spinlock_lock(&s->lock); + s->monitor_addr = pmc->addr; + rte_spinlock_unlock(&s->lock); + /* execute UMWAIT */ asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" : /* ignore rflags */ : "D"(0), /* enter C0.2 */ "a"(tsc_l), "d"(tsc_h)); + + /* erase sleep address */ + rte_spinlock_lock(&s->lock); + s->monitor_addr = NULL; + rte_spinlock_unlock(&s->lock); } /** @@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) { if (i.power_monitor && i.power_pause) wait_supported = 1; } + +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + struct power_wait_status *s; + + /* prevent buffer overrun */ + if (lcore_id >= RTE_MAX_LCORE) + return; + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + + s = &wait_status[lcore_id]; + + /* + * There is a race condition between sleep, wakeup and locking, but we + * don't need to handle it. + * + * Possible situations: + * + * 1. T1 locks, sets address, unlocks + * 2. T2 locks, triggers wakeup, unlocks + * 3. T1 sleeps + * + * In this case, because T1 has already set the address for monitoring, + * we will wake up immediately even if T2 triggers wakeup before T1 + * goes to sleep. + * + * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up + * 2. T2 locks, triggers wakeup, and unlocks + * 3. T1 locks, erases address, and unlocks + * + * In this case, since we've already woken up, the "wakeup" was + * unneeded, and since T1 is still waiting on T2 releasing the lock, the + * wakeup address is still valid so it's perfectly safe to write it. + */ + rte_spinlock_lock(&s->lock); + if (s->monitor_addr != NULL) + __umwait_wakeup(s->monitor_addr); + rte_spinlock_unlock(&s->lock); +} -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function Anatoly Burakov @ 2021-01-12 16:02 ` Ananyev, Konstantin 2021-01-12 16:18 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-12 16:02 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, gage.eads, McDaniel, Timothy, Hunt, David, Macnamara, Chris > > Now that we have everything in a C file, we can store the information > about our sleep, and have a native mechanism to wake up the sleeping > core. This mechanism would however only wake up a core that's sleeping > while monitoring - waking up from `rte_power_pause` won't work. > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > --- > > Notes: > v13: > - Add comments around wakeup code to explain what it does > - Add lcore_id parameter checking to prevent buffer overrun > > .../arm/include/rte_power_intrinsics.h | 9 ++ > .../include/generic/rte_power_intrinsics.h | 16 ++++ > .../ppc/include/rte_power_intrinsics.h | 9 ++ > lib/librte_eal/version.map | 1 + > lib/librte_eal/x86/rte_power_intrinsics.c | 85 +++++++++++++++++++ > 5 files changed, 120 insertions(+) > > diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h > index 27869251a8..39e49cc45b 100644 > --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h > +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h > @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp) > RTE_SET_USED(tsc_timestamp); > } > > +/** > + * This function is not supported on ARM. > + */ > +void > +rte_power_monitor_wakeup(const unsigned int lcore_id) > +{ > + RTE_SET_USED(lcore_id); > +} > + > #ifdef __cplusplus > } > #endif > diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h > index a6f1955996..e311d6f8ea 100644 > --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h > +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h > @@ -57,6 +57,22 @@ __rte_experimental > void rte_power_monitor(const struct rte_power_monitor_cond *pmc, > const uint64_t tsc_timestamp); > > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Wake up a specific lcore that is in a power optimized state and is monitoring > + * an address. > + * > + * @note This function will *not* wake up a core that is in a power optimized > + * state due to calling `rte_power_pause`. > + * > + * @param lcore_id > + * Lcore ID of a sleeping thread. > + */ > +__rte_experimental > +void rte_power_monitor_wakeup(const unsigned int lcore_id); > + > /** > * @warning > * @b EXPERIMENTAL: this API may change without prior notice > diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h > index 248d1f4a23..2e7db0e7eb 100644 > --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h > +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h > @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp) > RTE_SET_USED(tsc_timestamp); > } > > +/** > + * This function is not supported on PPC64. > + */ > +void > +rte_power_monitor_wakeup(const unsigned int lcore_id) > +{ > + RTE_SET_USED(lcore_id); > +} > + > #ifdef __cplusplus > } > #endif > diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map > index 20945b1efa..ac026e289d 100644 > --- a/lib/librte_eal/version.map > +++ b/lib/librte_eal/version.map > @@ -406,6 +406,7 @@ EXPERIMENTAL { > > # added in 21.02 > rte_power_monitor; > + rte_power_monitor_wakeup; > rte_power_pause; > }; > > diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c > index a9cd1afe9d..46a4fb6cd5 100644 > --- a/lib/librte_eal/x86/rte_power_intrinsics.c > +++ b/lib/librte_eal/x86/rte_power_intrinsics.c > @@ -2,8 +2,31 @@ > * Copyright(c) 2020 Intel Corporation > */ > > +#include <rte_common.h> > +#include <rte_lcore.h> > +#include <rte_spinlock.h> > + > #include "rte_power_intrinsics.h" > > +/* > + * Per-lcore structure holding current status of C0.2 sleeps. > + */ > +static struct power_wait_status { > + rte_spinlock_t lock; > + volatile void *monitor_addr; /**< NULL if not currently sleeping */ > +} __rte_cache_aligned wait_status[RTE_MAX_LCORE]; > + > +static inline void > +__umwait_wakeup(volatile void *addr) > +{ > + uint64_t val; > + > + /* trigger a write but don't change the value */ > + val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED); > + __atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0, > + __ATOMIC_RELAXED, __ATOMIC_RELAXED); > +} > + > static uint8_t wait_supported; > > static inline uint64_t > @@ -36,6 +59,12 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, > { > const uint32_t tsc_l = (uint32_t)tsc_timestamp; > const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); > + const unsigned int lcore_id = rte_lcore_id(); > + struct power_wait_status *s; > + > + /* prevent non-EAL thread from using this API */ > + if (lcore_id >= RTE_MAX_LCORE) > + return; > > /* prevent user from running this instruction if it's not supported */ > if (!wait_supported) > @@ -60,11 +89,24 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, > if (masked == pmc->val) > return; > } > + > + s = &wait_status[lcore_id]; > + > + /* update sleep address */ > + rte_spinlock_lock(&s->lock); > + s->monitor_addr = pmc->addr; > + rte_spinlock_unlock(&s->lock); It was a while, since I looked at it last time, but shouldn't we grab the lock before monitor()? I.E: lock(); monitor(); addr=...; unlock(); umwait(); > + > /* execute UMWAIT */ > asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" > : /* ignore rflags */ > : "D"(0), /* enter C0.2 */ > "a"(tsc_l), "d"(tsc_h)); > + > + /* erase sleep address */ > + rte_spinlock_lock(&s->lock); > + s->monitor_addr = NULL; > + rte_spinlock_unlock(&s->lock); > } > > /** > @@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) { > if (i.power_monitor && i.power_pause) > wait_supported = 1; > } > + > +void > +rte_power_monitor_wakeup(const unsigned int lcore_id) > +{ > + struct power_wait_status *s; > + > + /* prevent buffer overrun */ > + if (lcore_id >= RTE_MAX_LCORE) > + return; > + > + /* prevent user from running this instruction if it's not supported */ > + if (!wait_supported) > + return; > + > + s = &wait_status[lcore_id]; > + > + /* > + * There is a race condition between sleep, wakeup and locking, but we > + * don't need to handle it. > + * > + * Possible situations: > + * > + * 1. T1 locks, sets address, unlocks > + * 2. T2 locks, triggers wakeup, unlocks > + * 3. T1 sleeps > + * > + * In this case, because T1 has already set the address for monitoring, > + * we will wake up immediately even if T2 triggers wakeup before T1 > + * goes to sleep. > + * > + * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up > + * 2. T2 locks, triggers wakeup, and unlocks > + * 3. T1 locks, erases address, and unlocks > + * > + * In this case, since we've already woken up, the "wakeup" was > + * unneeded, and since T1 is still waiting on T2 releasing the lock, the > + * wakeup address is still valid so it's perfectly safe to write it. > + */ > + rte_spinlock_lock(&s->lock); > + if (s->monitor_addr != NULL) > + __umwait_wakeup(s->monitor_addr); > + rte_spinlock_unlock(&s->lock); > +} > -- > 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function 2021-01-12 16:02 ` Ananyev, Konstantin @ 2021-01-12 16:18 ` Burakov, Anatoly 2021-01-12 16:25 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-12 16:18 UTC (permalink / raw) To: Ananyev, Konstantin, dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, gage.eads, McDaniel, Timothy, Hunt, David, Macnamara, Chris On 12-Jan-21 4:02 PM, Ananyev, Konstantin wrote: > >> >> Now that we have everything in a C file, we can store the information >> about our sleep, and have a native mechanism to wake up the sleeping >> core. This mechanism would however only wake up a core that's sleeping >> while monitoring - waking up from `rte_power_pause` won't work. >> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> >> --- >> >> Notes: >> v13: >> - Add comments around wakeup code to explain what it does >> - Add lcore_id parameter checking to prevent buffer overrun >> >> .../arm/include/rte_power_intrinsics.h | 9 ++ >> .../include/generic/rte_power_intrinsics.h | 16 ++++ >> .../ppc/include/rte_power_intrinsics.h | 9 ++ >> lib/librte_eal/version.map | 1 + >> lib/librte_eal/x86/rte_power_intrinsics.c | 85 +++++++++++++++++++ >> 5 files changed, 120 insertions(+) >> >> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h >> index 27869251a8..39e49cc45b 100644 >> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h >> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h >> @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp) >> RTE_SET_USED(tsc_timestamp); >> } >> >> +/** >> + * This function is not supported on ARM. >> + */ >> +void >> +rte_power_monitor_wakeup(const unsigned int lcore_id) >> +{ >> +RTE_SET_USED(lcore_id); >> +} >> + >> #ifdef __cplusplus >> } >> #endif >> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h >> index a6f1955996..e311d6f8ea 100644 >> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h >> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h >> @@ -57,6 +57,22 @@ __rte_experimental >> void rte_power_monitor(const struct rte_power_monitor_cond *pmc, >> const uint64_t tsc_timestamp); >> >> +/** >> + * @warning >> + * @b EXPERIMENTAL: this API may change without prior notice >> + * >> + * Wake up a specific lcore that is in a power optimized state and is monitoring >> + * an address. >> + * >> + * @note This function will *not* wake up a core that is in a power optimized >> + * state due to calling `rte_power_pause`. >> + * >> + * @param lcore_id >> + * Lcore ID of a sleeping thread. >> + */ >> +__rte_experimental >> +void rte_power_monitor_wakeup(const unsigned int lcore_id); >> + >> /** >> * @warning >> * @b EXPERIMENTAL: this API may change without prior notice >> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h >> index 248d1f4a23..2e7db0e7eb 100644 >> --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h >> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h >> @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp) >> RTE_SET_USED(tsc_timestamp); >> } >> >> +/** >> + * This function is not supported on PPC64. >> + */ >> +void >> +rte_power_monitor_wakeup(const unsigned int lcore_id) >> +{ >> +RTE_SET_USED(lcore_id); >> +} >> + >> #ifdef __cplusplus >> } >> #endif >> diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map >> index 20945b1efa..ac026e289d 100644 >> --- a/lib/librte_eal/version.map >> +++ b/lib/librte_eal/version.map >> @@ -406,6 +406,7 @@ EXPERIMENTAL { >> >> # added in 21.02 >> rte_power_monitor; >> +rte_power_monitor_wakeup; >> rte_power_pause; >> }; >> >> diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c >> index a9cd1afe9d..46a4fb6cd5 100644 >> --- a/lib/librte_eal/x86/rte_power_intrinsics.c >> +++ b/lib/librte_eal/x86/rte_power_intrinsics.c >> @@ -2,8 +2,31 @@ >> * Copyright(c) 2020 Intel Corporation >> */ >> >> +#include <rte_common.h> >> +#include <rte_lcore.h> >> +#include <rte_spinlock.h> >> + >> #include "rte_power_intrinsics.h" >> >> +/* >> + * Per-lcore structure holding current status of C0.2 sleeps. >> + */ >> +static struct power_wait_status { >> +rte_spinlock_t lock; >> +volatile void *monitor_addr; /**< NULL if not currently sleeping */ >> +} __rte_cache_aligned wait_status[RTE_MAX_LCORE]; >> + >> +static inline void >> +__umwait_wakeup(volatile void *addr) >> +{ >> +uint64_t val; >> + >> +/* trigger a write but don't change the value */ >> +val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED); >> +__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0, >> +__ATOMIC_RELAXED, __ATOMIC_RELAXED); >> +} >> + >> static uint8_t wait_supported; >> >> static inline uint64_t >> @@ -36,6 +59,12 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, >> { >> const uint32_t tsc_l = (uint32_t)tsc_timestamp; >> const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); >> +const unsigned int lcore_id = rte_lcore_id(); >> +struct power_wait_status *s; >> + >> +/* prevent non-EAL thread from using this API */ >> +if (lcore_id >= RTE_MAX_LCORE) >> +return; >> >> /* prevent user from running this instruction if it's not supported */ >> if (!wait_supported) >> @@ -60,11 +89,24 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, >> if (masked == pmc->val) >> return; >> } >> + >> +s = &wait_status[lcore_id]; >> + >> +/* update sleep address */ >> +rte_spinlock_lock(&s->lock); >> +s->monitor_addr = pmc->addr; >> +rte_spinlock_unlock(&s->lock); > > It was a while, since I looked at it last time, > but shouldn't we grab the lock before monitor()? > I.E: > lock(); > monitor(); > addr=...; > unlock(); > umwait(); > I don't believe so. The idea here is to only store the address when we are looking to sleep, and avoid the locks entirely if we already know we aren't going to sleep. I mean, technically we could lock unconditionally, then unlock when we're done, but there's very little practical difference between the two because the moment we are interested in (sleep) happens the same way whether we lock before or after monitor(). >> + >> /* execute UMWAIT */ >> asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" >> : /* ignore rflags */ >> : "D"(0), /* enter C0.2 */ >> "a"(tsc_l), "d"(tsc_h)); >> + >> +/* erase sleep address */ >> +rte_spinlock_lock(&s->lock); >> +s->monitor_addr = NULL; >> +rte_spinlock_unlock(&s->lock); >> } >> >> /** >> @@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) { >> if (i.power_monitor && i.power_pause) >> wait_supported = 1; >> } >> + >> +void >> +rte_power_monitor_wakeup(const unsigned int lcore_id) >> +{ >> +struct power_wait_status *s; >> + >> +/* prevent buffer overrun */ >> +if (lcore_id >= RTE_MAX_LCORE) >> +return; >> + >> +/* prevent user from running this instruction if it's not supported */ >> +if (!wait_supported) >> +return; >> + >> +s = &wait_status[lcore_id]; >> + >> +/* >> + * There is a race condition between sleep, wakeup and locking, but we >> + * don't need to handle it. >> + * >> + * Possible situations: >> + * >> + * 1. T1 locks, sets address, unlocks >> + * 2. T2 locks, triggers wakeup, unlocks >> + * 3. T1 sleeps >> + * >> + * In this case, because T1 has already set the address for monitoring, >> + * we will wake up immediately even if T2 triggers wakeup before T1 >> + * goes to sleep. >> + * >> + * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up >> + * 2. T2 locks, triggers wakeup, and unlocks >> + * 3. T1 locks, erases address, and unlocks >> + * >> + * In this case, since we've already woken up, the "wakeup" was >> + * unneeded, and since T1 is still waiting on T2 releasing the lock, the >> + * wakeup address is still valid so it's perfectly safe to write it. >> + */ >> +rte_spinlock_lock(&s->lock); >> +if (s->monitor_addr != NULL) >> +__umwait_wakeup(s->monitor_addr); >> +rte_spinlock_unlock(&s->lock); >> +} >> -- >> 2.25.1 -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function 2021-01-12 16:18 ` Burakov, Anatoly @ 2021-01-12 16:25 ` Burakov, Anatoly 0 siblings, 0 replies; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-12 16:25 UTC (permalink / raw) To: Ananyev, Konstantin, dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, gage.eads, McDaniel, Timothy, Hunt, David, Macnamara, Chris On 12-Jan-21 4:18 PM, Burakov, Anatoly wrote: > On 12-Jan-21 4:02 PM, Ananyev, Konstantin wrote: >> >>> >>> Now that we have everything in a C file, we can store the information >>> about our sleep, and have a native mechanism to wake up the sleeping >>> core. This mechanism would however only wake up a core that's sleeping >>> while monitoring - waking up from `rte_power_pause` won't work. >>> >>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> >>> --- >>> >>> Notes: >>> v13: >>> - Add comments around wakeup code to explain what it does >>> - Add lcore_id parameter checking to prevent buffer overrun >>> >>> .../arm/include/rte_power_intrinsics.h | 9 ++ >>> .../include/generic/rte_power_intrinsics.h | 16 ++++ >>> .../ppc/include/rte_power_intrinsics.h | 9 ++ >>> lib/librte_eal/version.map | 1 + >>> lib/librte_eal/x86/rte_power_intrinsics.c | 85 +++++++++++++++++++ >>> 5 files changed, 120 insertions(+) >>> >>> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h >>> b/lib/librte_eal/arm/include/rte_power_intrinsics.h >>> index 27869251a8..39e49cc45b 100644 >>> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h >>> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h >>> @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp) >>> RTE_SET_USED(tsc_timestamp); >>> } >>> >>> +/** >>> + * This function is not supported on ARM. >>> + */ >>> +void >>> +rte_power_monitor_wakeup(const unsigned int lcore_id) >>> +{ >>> +RTE_SET_USED(lcore_id); >>> +} >>> + >>> #ifdef __cplusplus >>> } >>> #endif >>> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h >>> b/lib/librte_eal/include/generic/rte_power_intrinsics.h >>> index a6f1955996..e311d6f8ea 100644 >>> --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h >>> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h >>> @@ -57,6 +57,22 @@ __rte_experimental >>> void rte_power_monitor(const struct rte_power_monitor_cond *pmc, >>> const uint64_t tsc_timestamp); >>> >>> +/** >>> + * @warning >>> + * @b EXPERIMENTAL: this API may change without prior notice >>> + * >>> + * Wake up a specific lcore that is in a power optimized state and >>> is monitoring >>> + * an address. >>> + * >>> + * @note This function will *not* wake up a core that is in a power >>> optimized >>> + * state due to calling `rte_power_pause`. >>> + * >>> + * @param lcore_id >>> + * Lcore ID of a sleeping thread. >>> + */ >>> +__rte_experimental >>> +void rte_power_monitor_wakeup(const unsigned int lcore_id); >>> + >>> /** >>> * @warning >>> * @b EXPERIMENTAL: this API may change without prior notice >>> diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h >>> b/lib/librte_eal/ppc/include/rte_power_intrinsics.h >>> index 248d1f4a23..2e7db0e7eb 100644 >>> --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h >>> +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h >>> @@ -33,6 +33,15 @@ rte_power_pause(const uint64_t tsc_timestamp) >>> RTE_SET_USED(tsc_timestamp); >>> } >>> >>> +/** >>> + * This function is not supported on PPC64. >>> + */ >>> +void >>> +rte_power_monitor_wakeup(const unsigned int lcore_id) >>> +{ >>> +RTE_SET_USED(lcore_id); >>> +} >>> + >>> #ifdef __cplusplus >>> } >>> #endif >>> diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map >>> index 20945b1efa..ac026e289d 100644 >>> --- a/lib/librte_eal/version.map >>> +++ b/lib/librte_eal/version.map >>> @@ -406,6 +406,7 @@ EXPERIMENTAL { >>> >>> # added in 21.02 >>> rte_power_monitor; >>> +rte_power_monitor_wakeup; >>> rte_power_pause; >>> }; >>> >>> diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c >>> b/lib/librte_eal/x86/rte_power_intrinsics.c >>> index a9cd1afe9d..46a4fb6cd5 100644 >>> --- a/lib/librte_eal/x86/rte_power_intrinsics.c >>> +++ b/lib/librte_eal/x86/rte_power_intrinsics.c >>> @@ -2,8 +2,31 @@ >>> * Copyright(c) 2020 Intel Corporation >>> */ >>> >>> +#include <rte_common.h> >>> +#include <rte_lcore.h> >>> +#include <rte_spinlock.h> >>> + >>> #include "rte_power_intrinsics.h" >>> >>> +/* >>> + * Per-lcore structure holding current status of C0.2 sleeps. >>> + */ >>> +static struct power_wait_status { >>> +rte_spinlock_t lock; >>> +volatile void *monitor_addr; /**< NULL if not currently sleeping */ >>> +} __rte_cache_aligned wait_status[RTE_MAX_LCORE]; >>> + >>> +static inline void >>> +__umwait_wakeup(volatile void *addr) >>> +{ >>> +uint64_t val; >>> + >>> +/* trigger a write but don't change the value */ >>> +val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED); >>> +__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0, >>> +__ATOMIC_RELAXED, __ATOMIC_RELAXED); >>> +} >>> + >>> static uint8_t wait_supported; >>> >>> static inline uint64_t >>> @@ -36,6 +59,12 @@ rte_power_monitor(const struct >>> rte_power_monitor_cond *pmc, >>> { >>> const uint32_t tsc_l = (uint32_t)tsc_timestamp; >>> const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); >>> +const unsigned int lcore_id = rte_lcore_id(); >>> +struct power_wait_status *s; >>> + >>> +/* prevent non-EAL thread from using this API */ >>> +if (lcore_id >= RTE_MAX_LCORE) >>> +return; >>> >>> /* prevent user from running this instruction if it's not supported */ >>> if (!wait_supported) >>> @@ -60,11 +89,24 @@ rte_power_monitor(const struct >>> rte_power_monitor_cond *pmc, >>> if (masked == pmc->val) >>> return; >>> } >>> + >>> +s = &wait_status[lcore_id]; >>> + >>> +/* update sleep address */ >>> +rte_spinlock_lock(&s->lock); >>> +s->monitor_addr = pmc->addr; >>> +rte_spinlock_unlock(&s->lock); >> >> It was a while, since I looked at it last time, >> but shouldn't we grab the lock before monitor()? >> I.E: >> lock(); >> monitor(); >> addr=...; >> unlock(); >> umwait(); >> > > I don't believe so. > > The idea here is to only store the address when we are looking to sleep, > and avoid the locks entirely if we already know we aren't going to > sleep. I mean, technically we could lock unconditionally, then unlock > when we're done, but there's very little practical difference between > the two because the moment we are interested in (sleep) happens the same > way whether we lock before or after monitor(). On another thought, putting the lock before monitor() and unlocking afterwards allows us to ask for a wakeup earlier, without necessarily waiting for a sleep. So think i'll take your suggestion on board anyway, thanks! > >>> + >>> /* execute UMWAIT */ >>> asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" >>> : /* ignore rflags */ >>> : "D"(0), /* enter C0.2 */ >>> "a"(tsc_l), "d"(tsc_h)); >>> + >>> +/* erase sleep address */ >>> +rte_spinlock_lock(&s->lock); >>> +s->monitor_addr = NULL; >>> +rte_spinlock_unlock(&s->lock); >>> } >>> >>> /** >>> @@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) { >>> if (i.power_monitor && i.power_pause) >>> wait_supported = 1; >>> } >>> + >>> +void >>> +rte_power_monitor_wakeup(const unsigned int lcore_id) >>> +{ >>> +struct power_wait_status *s; >>> + >>> +/* prevent buffer overrun */ >>> +if (lcore_id >= RTE_MAX_LCORE) >>> +return; >>> + >>> +/* prevent user from running this instruction if it's not supported */ >>> +if (!wait_supported) >>> +return; >>> + >>> +s = &wait_status[lcore_id]; >>> + >>> +/* >>> + * There is a race condition between sleep, wakeup and locking, but we >>> + * don't need to handle it. >>> + * >>> + * Possible situations: >>> + * >>> + * 1. T1 locks, sets address, unlocks >>> + * 2. T2 locks, triggers wakeup, unlocks >>> + * 3. T1 sleeps >>> + * >>> + * In this case, because T1 has already set the address for monitoring, >>> + * we will wake up immediately even if T2 triggers wakeup before T1 >>> + * goes to sleep. >>> + * >>> + * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up >>> + * 2. T2 locks, triggers wakeup, and unlocks >>> + * 3. T1 locks, erases address, and unlocks >>> + * >>> + * In this case, since we've already woken up, the "wakeup" was >>> + * unneeded, and since T1 is still waiting on T2 releasing the lock, >>> the >>> + * wakeup address is still valid so it's perfectly safe to write it. >>> + */ >>> +rte_spinlock_lock(&s->lock); >>> +if (s->monitor_addr != NULL) >>> +__umwait_wakeup(s->monitor_addr); >>> +rte_spinlock_unlock(&s->lock); >>> +} >>> -- >>> 2.25.1 > > -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov ` (4 preceding siblings ...) 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 05/11] eal: add monitor wakeup function Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-09 8:04 ` Andrew Rybchenko 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 07/11] power: add PMD power management API and callback Anatoly Burakov ` (5 subsequent siblings) 11 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella, Neil Horman, konstantin.ananyev, gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add a simple API to allow getting the monitor conditions for power-optimized monitoring of the RX queues from the PMD, as well as release notes information. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v13: - Fix typos and issues raised by Andrew doc/guides/rel_notes/release_21_02.rst | 5 +++++ lib/librte_ethdev/rte_ethdev.c | 28 ++++++++++++++++++++++++++ lib/librte_ethdev/rte_ethdev.h | 25 +++++++++++++++++++++++ lib/librte_ethdev/rte_ethdev_driver.h | 22 ++++++++++++++++++++ lib/librte_ethdev/version.map | 3 +++ 5 files changed, 83 insertions(+) diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst index 638f98168b..6de0cb568e 100644 --- a/doc/guides/rel_notes/release_21_02.rst +++ b/doc/guides/rel_notes/release_21_02.rst @@ -55,6 +55,11 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **ethdev: added new API for PMD power management** + + * ``rte_eth_get_monitor_addr()``, to be used in conjunction with + ``rte_power_monitor()`` to enable automatic power management for PMD's. + Removed Items ------------- diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c index 17ddacc78d..e19dbd838b 100644 --- a/lib/librte_ethdev/rte_ethdev.c +++ b/lib/librte_ethdev/rte_ethdev.c @@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode)); } +int +rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id, + struct rte_power_monitor_cond *pmc) +{ + struct rte_eth_dev *dev; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV); + + dev = &rte_eth_devices[port_id]; + + RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP); + + if (queue_id >= dev->data->nb_rx_queues) { + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id); + return -EINVAL; + } + + if (pmc == NULL) { + RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n", + pmc); + return -EINVAL; + } + + return eth_err(port_id, + dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id], + pmc)); +} + int rte_eth_dev_set_mc_addr_list(uint16_t port_id, struct rte_ether_addr *mc_addr_set, diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h index f5f8919186..ca0f91312e 100644 --- a/lib/librte_ethdev/rte_ethdev.h +++ b/lib/librte_ethdev/rte_ethdev.h @@ -157,6 +157,7 @@ extern "C" { #include <rte_common.h> #include <rte_config.h> #include <rte_ether.h> +#include <rte_power_intrinsics.h> #include "rte_ethdev_trace_fp.h" #include "rte_dev_info.h" @@ -4334,6 +4335,30 @@ __rte_experimental int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, struct rte_eth_burst_mode *mode); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Retrieve the monitor condition for a given receive queue. + * + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The Rx queue on the Ethernet device for which information + * will be retrieved. + * @param pmc + * The pointer point to power-optimized monitoring condition structure. + * + * @return + * - 0: Success. + * -ENOTSUP: Operation not supported. + * -EINVAL: Invalid parameters. + * -ENODEV: Invalid port ID. + */ +__rte_experimental +int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id, + struct rte_power_monitor_cond *pmc); + /** * Retrieve device registers and register attributes (number of registers and * register size) diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h index 0eacfd8425..3b3b0ec1a0 100644 --- a/lib/librte_ethdev/rte_ethdev_driver.h +++ b/lib/librte_ethdev/rte_ethdev_driver.h @@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t) (struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction); /**< @internal Unbind peer queue from the current queue. */ +/** + * @internal + * Get address of memory location whose contents will change whenever there is + * new data to be received on an Rx queue. + * + * @param rxq + * Ethdev queue pointer. + * @param pmc + * The pointer to power-optimized monitoring condition structure. + * @return + * Negative errno value on error, 0 on success. + * + * @retval 0 + * Success + * @retval -EINVAL + * Invalid parameters + */ +typedef int (*eth_get_monitor_addr_t)(void *rxq, + struct rte_power_monitor_cond *pmc); + /** * @internal A structure containing the functions exported by an Ethernet driver. */ @@ -917,6 +937,8 @@ struct eth_dev_ops { /**< Set up the connection between the pair of hairpin queues. */ eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind; /**< Disconnect the hairpin queues of a pair from each other. */ + eth_get_monitor_addr_t get_monitor_addr; + /**< Get power monitoring condition for Rx queue. */ }; /** diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map index d3f5410806..a124e1e370 100644 --- a/lib/librte_ethdev/version.map +++ b/lib/librte_ethdev/version.map @@ -240,6 +240,9 @@ EXPERIMENTAL { rte_flow_get_restore_info; rte_flow_tunnel_action_decap_release; rte_flow_tunnel_item_release; + + # added in 21.02 + rte_eth_get_monitor_addr; }; INTERNAL { -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API Anatoly Burakov @ 2021-01-09 8:04 ` Andrew Rybchenko 0 siblings, 0 replies; 421+ messages in thread From: Andrew Rybchenko @ 2021-01-09 8:04 UTC (permalink / raw) To: Anatoly Burakov, dev Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Ray Kinsella, Neil Horman, konstantin.ananyev, gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara On 1/8/21 8:42 PM, Anatoly Burakov wrote: > From: Liang Ma <liang.j.ma@intel.com> > > Add a simple API to allow getting the monitor conditions for > power-optimized monitoring of the RX queues from the PMD, as well as > release notes information. RX -> Rx > > Signed-off-by: Liang Ma <liang.j.ma@intel.com> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 07/11] power: add PMD power management API and callback 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov ` (5 preceding siblings ...) 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 06/11] ethdev: add simple power management API Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 08/11] net/ixgbe: implement power management API Anatoly Burakov ` (4 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas, konstantin.ananyev, gage.eads, timothy.mcdaniel, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add a simple on/off switch that will enable saving power when no packets are arriving. It is based on counting the number of empty polls and, when the number reaches a certain threshold, entering an architecture-defined optimized power state that will either wait until a TSC timestamp expires, or when packets arrive. This API mandates a core-to-single-queue mapping (that is, multiple queued per device are supported, but they have to be polled on different cores). This design is using PMD RX callbacks. 1. UMWAIT/UMONITOR: When a certain threshold of empty polls is reached, the core will go into a power optimized sleep while waiting on an address of next RX descriptor to be written to. 2. TPAUSE/Pause instruction This method uses the pause (or TPAUSE, if available) instruction to avoid busy polling. 3. Frequency scaling Reuse existing DPDK power library to scale up/down core frequency depending on traffic volume. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v13: - Rework the synchronization mechanism to not require locking - Add more parameter checking - Rework n_rx_queues access to not go through internal PMD structures and use public API instead doc/guides/prog_guide/power_man.rst | 44 +++ doc/guides/rel_notes/release_21_02.rst | 10 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 360 +++++++++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++++ lib/librte_power/version.map | 5 + 6 files changed, 512 insertions(+), 2 deletions(-) create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index 0a3755a901..02280dd689 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -192,6 +192,47 @@ User Cases ---------- The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA. +PMD Power Management API +------------------------ + +Abstract +~~~~~~~~ +Existing power management mechanisms require developers to change application +design or change code to make use of it. The PMD power management API provides a +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering +power saving whenever empty poll count reaches a certain number. + + * Monitor + + This power saving scheme will put the CPU into optimized power state and use + the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX + descriptor address, and wake the CPU up whenever there's new traffic. + + * Pause + + This power saving scheme will avoid busy polling by either entering + power-optimized sleep state with ``rte_power_pause()`` function, or, if it's + not available, use ``rte_pause()``. + + * Frequency scaling + + This power saving scheme will use existing ``librte_power`` library + functionality to scale the core frequency up/down depending on traffic + volume. + + +.. note:: + + Currently, this power management API is limited to mandatory mapping of 1 + queue to 1 core (multiple queues are supported, but they must be polled from + different cores). + +API Overview for PMD Power Management +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +* **Queue Enable**: Enable specific power scheme for certain queue/port/core + +* **Queue Disable**: Disable power scheme for certain queue/port/core + References ---------- @@ -200,3 +241,6 @@ References * The :doc:`../sample_app_ug/vm_power_management` chapter in the :doc:`../sample_app_ug/index` section. + +* The :doc:`../sample_app_ug/rxtx_callbacks` + chapter in the :doc:`../sample_app_ug/index` section. diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst index 6de0cb568e..b34828cad6 100644 --- a/doc/guides/rel_notes/release_21_02.rst +++ b/doc/guides/rel_notes/release_21_02.rst @@ -60,6 +60,16 @@ New Features * ``rte_eth_get_monitor_addr()``, to be used in conjunction with ``rte_power_monitor()`` to enable automatic power management for PMD's. +* **Add PMD power management helper API** + + A new helper API has been added to make using Ethernet PMD power management + easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power + management schemes are supported initially: + + * Power saving based on UMWAIT instruction (x86 only) + * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only) + * Power saving based on frequency scaling through the ``librte_power`` library + Removed Items ------------- diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build index 4b4cf1b90b..51a471b669 100644 --- a/lib/librte_power/meson.build +++ b/lib/librte_power/meson.build @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c', 'power_kvm_vm.c', 'guest_channel.c', 'rte_power_empty_poll.c', 'power_pstate_cpufreq.c', + 'rte_power_pmd_mgmt.c', 'power_common.c') -headers = files('rte_power.h','rte_power_empty_poll.h') -deps += ['timer'] +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h') +deps += ['timer' ,'ethdev'] diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c new file mode 100644 index 0000000000..65597d354c --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.c @@ -0,0 +1,360 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <rte_lcore.h> +#include <rte_cycles.h> +#include <rte_cpuflags.h> +#include <rte_malloc.h> +#include <rte_ethdev.h> +#include <rte_power_intrinsics.h> + +#include "rte_power_pmd_mgmt.h" + +#define EMPTYPOLL_MAX 512 + +static struct pmd_conf_data { + struct rte_cpu_intrinsics intrinsics_support; + /**< what do we support? */ + uint64_t tsc_per_us; + /**< pre-calculated tsc diff for 1us */ + uint64_t pause_per_us; + /**< how many rte_pause can we fit in a microisecond? */ +} global_data; + +/** + * Possible power management states of an ethdev port. + */ +enum pmd_mgmt_state { + /** Device power management is disabled. */ + PMD_MGMT_DISABLED = 0, + /** Device power management is enabled. */ + PMD_MGMT_ENABLED, + /** Device powermanagement status is about to change. */ + PMD_MGMT_BUSY +}; + +struct pmd_queue_cfg { + volatile enum pmd_mgmt_state pwr_mgmt_state; + /**< State of power management for this queue */ + enum rte_power_pmd_mgmt_type cb_mode; + /**< Callback mode for this queue */ + const struct rte_eth_rxtx_callback *cur_cb; + /**< Callback instance */ + volatile bool umwait_in_progress; + /**< are we currently sleeping? */ + uint64_t empty_poll_stats; + /**< Number of empty polls */ +} __rte_cache_aligned; + +static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT]; + +static void +calc_tsc(void) +{ + const uint64_t hz = rte_get_timer_hz(); + const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */ + + global_data.tsc_per_us = tsc_per_us; + + /* only do this if we don't have tpause */ + if (!global_data.intrinsics_support.power_pause) { + const uint64_t start = rte_rdtsc_precise(); + const uint32_t n_pauses = 10000; + double us, us_per_pause; + uint64_t end; + unsigned int i; + + /* estimate number of rte_pause() calls per us*/ + for (i = 0; i < n_pauses; i++) + rte_pause(); + + end = rte_rdtsc_precise(); + us = (end - start) / (double)tsc_per_us; + us_per_pause = us / n_pauses; + + global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause); + } +} + +static uint16_t +clb_umwait(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + struct rte_power_monitor_cond pmc; + uint16_t ret; + + /* + * we might get a cancellation request while being + * inside the callback, in which case the wakeup + * wouldn't work because it would've arrived too early. + * + * to get around this, we notify the other thread that + * we're sleeping, so that it can spin until we're done. + * unsolicited wakeups are perfectly safe. + */ + q_conf->umwait_in_progress = true; + + /* check if we need to cancel sleep */ + if (q_conf->pwr_mgmt_state != PMD_MGMT_ENABLED) { + /* use monitoring condition to sleep */ + ret = rte_eth_get_monitor_addr(port_id, qidx, + &pmc); + if (ret == 0) + rte_power_monitor(&pmc, -1ULL); + } + q_conf->umwait_in_progress = false; + } + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_pause(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + /* sleep for 1 microsecond */ + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + /* use tpause if we have it */ + if (global_data.intrinsics_support.power_pause) { + const uint64_t cur = rte_rdtsc(); + const uint64_t wait_tsc = + cur + global_data.tsc_per_us; + rte_power_pause(wait_tsc); + } else { + uint64_t i; + for (i = 0; i < global_data.pause_per_us; i++) + rte_pause(); + } + } + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_scale_freq(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *_ __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + /* scale down freq */ + rte_power_freq_min(rte_lcore_id()); + } else { + q_conf->empty_poll_stats = 0; + /* scale up freq */ + rte_power_freq_max(rte_lcore_id()); + } + + return nb_rx; +} + +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode) +{ + struct pmd_queue_cfg *queue_cfg; + struct rte_eth_dev_info info; + int ret; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + + if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) { + ret = -EINVAL; + goto end; + } + + if (rte_eth_dev_info_get(port_id, &info) < 0) { + ret = -EINVAL; + goto end; + } + + /* check if queue id is valid */ + if (queue_id >= info.nb_rx_queues) { + ret = -EINVAL; + goto end; + } + + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) { + ret = -EINVAL; + goto end; + } + + /* we're about to change our state */ + queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; + + /* we need this in various places */ + rte_cpu_get_intrinsics_support(&global_data.intrinsics_support); + + switch (mode) { + case RTE_POWER_MGMT_TYPE_MONITOR: + { + struct rte_power_monitor_cond dummy; + + /* check if rte_power_monitor is supported */ + if (!global_data.intrinsics_support.power_monitor) { + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); + ret = -ENOTSUP; + goto rollback; + } + + /* check if the device supports the necessary PMD API */ + if (rte_eth_get_monitor_addr(port_id, queue_id, + &dummy) == -ENOTSUP) { + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n"); + ret = -ENOTSUP; + goto rollback; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->umwait_in_progress = false; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_umwait, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_SCALE: + { + enum power_management_env env; + /* only PSTATE and ACPI modes are supported */ + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && + !rte_power_check_env_supported( + PM_ENV_PSTATE_CPUFREQ)) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); + ret = -ENOTSUP; + goto rollback; + } + /* ensure we could initialize the power library */ + if (rte_power_init(lcore_id)) { + ret = -EINVAL; + goto rollback; + } + /* ensure we initialized the correct env */ + env = rte_power_get_env(); + if (env != PM_ENV_ACPI_CPUFREQ && + env != PM_ENV_PSTATE_CPUFREQ) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); + ret = -ENOTSUP; + goto rollback; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, + queue_id, clb_scale_freq, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_PAUSE: + /* figure out various time-to-tsc conversions */ + if (global_data.tsc_per_us == 0) + calc_tsc(); + + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_pause, NULL); + break; + } + ret = 0; + + return ret; + +rollback: + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; +end: + return ret; +} + +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id) +{ + struct pmd_queue_cfg *queue_cfg; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + + if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT) + return -EINVAL; + + /* no need to check queue id as wrong queue id would not be enabled */ + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) + return -EINVAL; + + /* let the callback know we're shutting down */ + queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; + + switch (queue_cfg->cb_mode) { + case RTE_POWER_MGMT_TYPE_MONITOR: + { + bool exit = false; + do { + /* + * we may request cancellation while the other thread + * has just entered the callback but hasn't started + * sleeping yet, so keep waking it up until we know it's + * done sleeping. + */ + if (queue_cfg->umwait_in_progress) + rte_power_monitor_wakeup(lcore_id); + else + exit = true; + } while (!exit); + } + /* fall-through */ + case RTE_POWER_MGMT_TYPE_PAUSE: + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + break; + case RTE_POWER_MGMT_TYPE_SCALE: + rte_power_freq_max(lcore_id); + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + rte_power_exit(lcore_id); + break; + } + /* + * we don't free the RX callback here because it is unsafe to do so + * unless we know for a fact that all data plane threads have stopped. + */ + queue_cfg->cur_cb = NULL; + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + + return 0; +} diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h new file mode 100644 index 0000000000..0bfbc6ba69 --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#ifndef _RTE_POWER_PMD_MGMT_H +#define _RTE_POWER_PMD_MGMT_H + +/** + * @file + * RTE PMD Power Management + */ +#include <stdint.h> +#include <stdbool.h> + +#include <rte_common.h> +#include <rte_byteorder.h> +#include <rte_log.h> +#include <rte_power.h> +#include <rte_atomic.h> + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * PMD Power Management Type + */ +enum rte_power_pmd_mgmt_type { + /** Use power-optimized monitoring to wait for incoming traffic */ + RTE_POWER_MGMT_TYPE_MONITOR = 1, + /** Use power-optimized sleep to avoid busy polling */ + RTE_POWER_MGMT_TYPE_PAUSE, + /** Use frequency scaling when traffic is low */ + RTE_POWER_MGMT_TYPE_SCALE, +}; + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Enable power management on a specified RX queue and lcore. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @param mode + * The power management callback function type. + + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode); + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Disable power management on a specified RX queue and lcore. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id); +#ifdef __cplusplus +} +#endif + +#endif diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map index 69ca9af616..61996b4d11 100644 --- a/lib/librte_power/version.map +++ b/lib/librte_power/version.map @@ -34,4 +34,9 @@ EXPERIMENTAL { rte_power_guest_channel_receive_msg; rte_power_poll_stat_fetch; rte_power_poll_stat_update; + + # added in 21.02 + rte_power_pmd_mgmt_queue_enable; + rte_power_pmd_mgmt_queue_disable; + }; -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 08/11] net/ixgbe: implement power management API 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov ` (6 preceding siblings ...) 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 07/11] power: add PMD power management API and callback Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 09/11] net/i40e: " Anatoly Burakov ` (3 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Liang Ma, Jeff Guo, Haiyue Wang, thomas, konstantin.ananyev, gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: Liang Ma <liang.j.ma@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 +++++++++++++++++++++++++ drivers/net/ixgbe/ixgbe_rxtx.h | 1 + 3 files changed, 27 insertions(+) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index 9a47a8b262..4b7a5ca60b 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -560,6 +560,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = { .udp_tunnel_port_del = ixgbe_dev_udp_tunnel_port_del, .tm_ops_get = ixgbe_tm_ops_get, .tx_done_cleanup = ixgbe_dev_tx_done_cleanup, + .get_monitor_addr = ixgbe_get_monitor_addr, }; /* diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index 6cfbb582e2..7e046a1819 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.c +++ b/drivers/net/ixgbe/ixgbe_rxtx.c @@ -1369,6 +1369,31 @@ const uint32_t RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP, }; +int +ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + volatile union ixgbe_adv_rx_desc *rxdp; + struct ixgbe_rx_queue *rxq = rx_queue; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.upper.status_error; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + + /* the registers are 32-bit */ + pmc->data_sz = sizeof(uint32_t); + + return 0; +} + /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */ static inline uint32_t ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask) diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h index 6d2f7c9da3..8a25e98df6 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.h +++ b/drivers/net/ixgbe/ixgbe_rxtx.h @@ -299,5 +299,6 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev); +int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); #endif /* _IXGBE_RXTX_H_ */ -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 09/11] net/i40e: implement power management API 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov ` (7 preceding siblings ...) 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 08/11] net/ixgbe: implement power management API Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 10/11] net/ice: " Anatoly Burakov ` (2 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Liang Ma, Beilei Xing, Jeff Guo, thomas, konstantin.ananyev, gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Acked-by: Jeff Guo <jia.guo@intel.com> --- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 25 +++++++++++++++++++++++++ drivers/net/i40e/i40e_rxtx.h | 1 + 3 files changed, 27 insertions(+) diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c index f54769c29d..af2577a140 100644 --- a/drivers/net/i40e/i40e_ethdev.c +++ b/drivers/net/i40e/i40e_ethdev.c @@ -510,6 +510,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = { .mtu_set = i40e_dev_mtu_set, .tm_ops_get = i40e_tm_ops_get, .tx_done_cleanup = i40e_tx_done_cleanup, + .get_monitor_addr = i40e_get_monitor_addr, }; /* store statistics names and its offset in stats structure */ diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 5df9a9df56..0b4220fc9c 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -72,6 +72,31 @@ #define I40E_TX_OFFLOAD_NOTSUP_MASK \ (PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK) +int +i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + struct i40e_rx_queue *rxq = rx_queue; + volatile union i40e_rx_desc *rxdp; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.qword1.status_error_len; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + + /* registers are 64-bit */ + pmc->data_sz = sizeof(uint64_t); + + return 0; +} + static inline void i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp) { diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h index 57d7b4160b..e1494525ce 100644 --- a/drivers/net/i40e/i40e_rxtx.h +++ b/drivers/net/i40e/i40e_rxtx.h @@ -248,6 +248,7 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); +int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); /* For each value it means, datasheet of hardware can tell more details * -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 10/11] net/ice: implement power management API 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov ` (8 preceding siblings ...) 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 09/11] net/i40e: " Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Liang Ma, Qiming Yang, Qi Zhang, thomas, konstantin.ananyev, gage.eads, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++++++++++++++++++++++++++ drivers/net/ice/ice_rxtx.h | 1 + 3 files changed, 28 insertions(+) diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c index 9a5d6a559f..c21682c120 100644 --- a/drivers/net/ice/ice_ethdev.c +++ b/drivers/net/ice/ice_ethdev.c @@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = { .udp_tunnel_port_add = ice_dev_udp_tunnel_port_add, .udp_tunnel_port_del = ice_dev_udp_tunnel_port_del, .tx_done_cleanup = ice_tx_done_cleanup, + .get_monitor_addr = ice_get_monitor_addr, }; /* store statistics names and its offset in stats structure */ diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c index 5fbd68eafc..fa9e9a235b 100644 --- a/drivers/net/ice/ice_rxtx.c +++ b/drivers/net/ice/ice_rxtx.c @@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask; uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask; uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask; +int +ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + volatile union ice_rx_flex_desc *rxdp; + struct ice_rx_queue *rxq = rx_queue; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.status_error0; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + + /* register is 16-bit */ + pmc->data_sz = sizeof(uint16_t); + + return 0; +} + + static inline uint8_t ice_proto_xtr_type_to_rxdid(uint8_t xtr_type) { diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h index 6b16716063..906fbefdc4 100644 --- a/drivers/net/ice/ice_rxtx.h +++ b/drivers/net/ice/ice_rxtx.h @@ -263,6 +263,7 @@ uint16_t ice_xmit_pkts_vec_avx512(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc); int ice_tx_done_cleanup(void *txq, uint32_t free_cnt); +int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \ int i; \ -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v13 11/11] examples/l3fwd-power: enable PMD power mgmt 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov ` (9 preceding siblings ...) 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 10/11] net/ice: " Anatoly Burakov @ 2021-01-08 17:42 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-08 17:42 UTC (permalink / raw) To: dev Cc: Liang Ma, David Hunt, thomas, konstantin.ananyev, gage.eads, timothy.mcdaniel, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add PMD power management feature support to l3fwd-power sample app. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v12: - Allow selecting PMD power management scheme from command-line - Enforce 1 core 1 queue rule .../sample_app_ug/l3_forward_power_man.rst | 35 ++++++++ examples/l3fwd-power/main.c | 89 ++++++++++++++++++- 2 files changed, 122 insertions(+), 2 deletions(-) diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst index 85a78a5c1e..aaa9367fae 100644 --- a/doc/guides/sample_app_ug/l3_forward_power_man.rst +++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst @@ -109,6 +109,8 @@ where, * --telemetry: Telemetry mode. +* --pmd-mgmt: PMD power management mode. + See :doc:`l3_forward` for details. The L3fwd-power example reuses the L3fwd command line options. @@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set to either 0% or The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``. + +PMD power management Mode +------------------------- + +The PMD power management mode support for ``l3fwd-power`` is a standalone mode, in this mode +``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific +port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API. + +.. code-block:: console + + ./build/examples/dpdk-l3fwd-power -l 1-3 -- --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)" + +PMD Power Management Mode +------------------------- +There is also a traffic-aware operating mode that, instead of using explicit +power management, will use automatic PMD power management. This mode is limited +to one queue per core, and has three available power management schemes: + +* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a + power-optimized state (subject to platform support). + +* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid + busy looping when there is no traffic. + +* ``scale`` - this will use frequency scaling routines available in the + ``librte_power`` library. + +See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK +Programmer's Guide for more details on PMD power management. + +.. code-block:: console + + ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index 995a3b6ad7..e312b6f355 100644 --- a/examples/l3fwd-power/main.c +++ b/examples/l3fwd-power/main.c @@ -47,6 +47,7 @@ #include <rte_power_empty_poll.h> #include <rte_metrics.h> #include <rte_telemetry.h> +#include <rte_power_pmd_mgmt.h> #include "perf_core.h" #include "main.h" @@ -199,11 +200,14 @@ enum appmode { APP_MODE_LEGACY, APP_MODE_EMPTY_POLL, APP_MODE_TELEMETRY, - APP_MODE_INTERRUPT + APP_MODE_INTERRUPT, + APP_MODE_PMD_MGMT }; enum appmode app_mode; +static enum rte_power_pmd_mgmt_type pmgmt_type; + enum freq_scale_hint_t { FREQ_LOWER = -1, @@ -1611,7 +1615,9 @@ print_usage(const char *prgname) " follow (training_flag, high_threshold, med_threshold)\n" " --telemetry: enable telemetry mode, to update" " empty polls, full polls, and core busyness to telemetry\n" - " --interrupt-only: enable interrupt-only mode\n", + " --interrupt-only: enable interrupt-only mode\n" + " --pmd-mgmt MODE: enable PMD power management mode. " + "Currently supported modes: monitor, pause, scale\n", prgname); } @@ -1701,6 +1707,32 @@ parse_config(const char *q_arg) return 0; } + +static int +parse_pmd_mgmt_config(const char *name) +{ +#define PMD_MGMT_MONITOR "monitor" +#define PMD_MGMT_PAUSE "pause" +#define PMD_MGMT_SCALE "scale" + + if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR; + return 0; + } + + if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE; + return 0; + } + + if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE; + return 0; + } + /* unknown PMD power management mode */ + return -1; +} + static int parse_ep_config(const char *q_arg) { @@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg) #define CMD_LINE_OPT_EMPTY_POLL "empty-poll" #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only" #define CMD_LINE_OPT_TELEMETRY "telemetry" +#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt" /* Parse the argument given in the command line of the application */ static int @@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv) {CMD_LINE_OPT_LEGACY, 0, 0, 0}, {CMD_LINE_OPT_TELEMETRY, 0, 0, 0}, {CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0}, + {CMD_LINE_OPT_PMD_MGMT, 1, 0, 0}, {NULL, 0, 0, 0} }; @@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv) printf("telemetry mode is enabled\n"); } + if (!strncmp(lgopts[option_index].name, + CMD_LINE_OPT_PMD_MGMT, + sizeof(CMD_LINE_OPT_PMD_MGMT))) { + if (app_mode != APP_MODE_DEFAULT) { + printf(" power mgmt mode is mutually exclusive with other modes\n"); + return -1; + } + if (parse_pmd_mgmt_config(optarg) < 0) { + printf(" Invalid PMD power management mode: %s\n", + optarg); + return -1; + } + app_mode = APP_MODE_PMD_MGMT; + printf("PMD power mgmt mode is enabled\n"); + } if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_INTERRUPT_ONLY, sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) { @@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode) return "telemetry"; case APP_MODE_INTERRUPT: return "interrupt-only"; + case APP_MODE_PMD_MGMT: + return "pmd mgmt"; default: return "invalid"; } @@ -2671,6 +2722,13 @@ main(int argc, char **argv) qconf = &lcore_conf[lcore_id]; printf("\nInitializing rx queues on lcore %u ... ", lcore_id ); fflush(stdout); + + /* PMD power management mode can only do 1 queue per core */ + if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) { + rte_exit(EXIT_FAILURE, + "In PMD power management mode, only one queue per lcore is allowed\n"); + } + /* init RX queues */ for(queue = 0; queue < qconf->n_rx_queue; ++queue) { struct rte_eth_rxconf rxq_conf; @@ -2708,6 +2766,16 @@ main(int argc, char **argv) rte_exit(EXIT_FAILURE, "Fail to add ptype cb\n"); } + + if (app_mode == APP_MODE_PMD_MGMT) { + ret = rte_power_pmd_mgmt_queue_enable( + lcore_id, portid, queueid, + pmgmt_type); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_power_pmd_mgmt_queue_enable: err=%d, port=%d\n", + ret, portid); + } } } @@ -2798,6 +2866,9 @@ main(int argc, char **argv) SKIP_MAIN); } else if (app_mode == APP_MODE_INTERRUPT) { rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN); + } else if (app_mode == APP_MODE_PMD_MGMT) { + /* reuse telemetry loop for PMD power management mode */ + rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN); } if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY) @@ -2824,6 +2895,20 @@ main(int argc, char **argv) if (app_mode == APP_MODE_EMPTY_POLL) rte_power_empty_poll_stat_free(); + if (app_mode == APP_MODE_PMD_MGMT) { + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + qconf = &lcore_conf[lcore_id]; + for (queue = 0; queue < qconf->n_rx_queue; ++queue) { + portid = qconf->rx_queue_list[queue].port_id; + queueid = qconf->rx_queue_list[queue].queue_id; + rte_power_pmd_mgmt_queue_disable(lcore_id, + portid, queueid); + } + } + } + if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) && deinit_power_library()) rte_exit(EXIT_FAILURE, "deinit_power_library failed\n"); -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 00/11] Add PMD power management 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 " Anatoly Burakov ` (10 preceding siblings ...) 2021-01-08 17:42 ` [dpdk-dev] [PATCH v13 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 01/11] eal: uninline power intrinsics Anatoly Burakov ` (11 more replies) 11 siblings, 12 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara This patchset proposes a simple API for Ethernet drivers to cause the CPU to enter a power-optimized state while waiting for packets to arrive. There are multiple proposed mechanisms to achieve said power savings: simple frequency scaling, idle loop, and monitoring the Rx queue for incoming packages. The latter is achieved through cooperation with the NIC driver that will allow us to know address of wake up event, and wait for writes on that address. On IA, this is achieved through using UMONITOR/UMWAIT instructions. They are used in their raw opcode form because there is no widespread compiler support for them yet. Still, the API is made generic enough to hopefully support other architectures, if they happen to implement similar instructions. To achieve power savings, there is a very simple mechanism used: we're counting empty polls, and if a certain threshold is reached, we employ one of the suggested power management schemes automatically, from within a Rx callback inside the PMD. Once there's traffic again, the empty poll counter is reset. This patchset also introduces a few changes into existing power management-related intrinsics, namely to provide a native way of waking up a sleeping core without application being responsible for it, as well as general robustness improvements. There's quite a bit of locking going on, but these locks are per-thread and very little (if any) contention is expected, so the performance impact shouldn't be that bad (and in any case the locking happens when we're about to sleep anyway). Why are we putting it into ethdev as opposed to leaving this up to the application? Our customers specifically requested a way to do it with minimal changes to the application code. The current approach allows to just flip a switch and automatically have power savings. Things of note: - Only 1:1 core to queue mapping is supported, meaning that each lcore must at most handle RX on a single queue - Support 3 type policies. Monitor/Pause/Frequency Scaling - Power management is enabled per-queue - The API doesn't extend to other device types 14: - Fixed ARM/PPC builds - Addressed various review comments v13: - Reworked the librte_power code to require less locking and handle invalid parameters better - Fix numerous rebase errors present in v12 v12: - Rebase on top of 21.02 - Rework of power intrinsics code Anatoly Burakov (5): eal: uninline power intrinsics eal: avoid invalid API usage in power intrinsics eal: change API of power intrinsics eal: remove sync version of power monitor eal: add monitor wakeup function Liang Ma (6): ethdev: add simple power management API power: add PMD power management API and callback net/ixgbe: implement power management API net/i40e: implement power management API net/ice: implement power management API examples/l3fwd-power: enable PMD power mgmt doc/guides/prog_guide/power_man.rst | 44 +++ doc/guides/rel_notes/release_21_02.rst | 15 + .../sample_app_ug/l3_forward_power_man.rst | 35 ++ drivers/event/dlb/dlb.c | 10 +- drivers/event/dlb2/dlb2.c | 10 +- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 25 ++ drivers/net/i40e/i40e_rxtx.h | 1 + drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++ drivers/net/ice/ice_rxtx.h | 1 + drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 ++ drivers/net/ixgbe/ixgbe_rxtx.h | 1 + examples/l3fwd-power/main.c | 89 ++++- .../arm/include/rte_power_intrinsics.h | 40 -- lib/librte_eal/arm/meson.build | 1 + lib/librte_eal/arm/rte_power_intrinsics.c | 34 ++ .../include/generic/rte_power_intrinsics.h | 78 ++-- .../ppc/include/rte_power_intrinsics.h | 40 -- lib/librte_eal/ppc/meson.build | 1 + lib/librte_eal/ppc/rte_power_intrinsics.c | 34 ++ lib/librte_eal/version.map | 5 + .../x86/include/rte_power_intrinsics.h | 115 ------ lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 184 +++++++++ lib/librte_ethdev/rte_ethdev.c | 28 ++ lib/librte_ethdev/rte_ethdev.h | 25 ++ lib/librte_ethdev/rte_ethdev_driver.h | 22 ++ lib/librte_ethdev/version.map | 3 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 360 ++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++ lib/librte_power/version.map | 5 + 34 files changed, 1097 insertions(+), 259 deletions(-) create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 01/11] eal: uninline power intrinsics 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 02/11] eal: avoid invalid API usage in " Anatoly Burakov ` (10 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, power intrinsics are inline functions. Make them part of the ABI so that we can have various internal data associated with them without exposing said data to the outside world. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v14: - Fix compile issues on ARM and PPC64 by moving implementations to .c files .../arm/include/rte_power_intrinsics.h | 40 ------ lib/librte_eal/arm/meson.build | 1 + lib/librte_eal/arm/rte_power_intrinsics.c | 42 ++++++ .../include/generic/rte_power_intrinsics.h | 6 +- .../ppc/include/rte_power_intrinsics.h | 40 ------ lib/librte_eal/ppc/meson.build | 1 + lib/librte_eal/ppc/rte_power_intrinsics.c | 42 ++++++ lib/librte_eal/version.map | 5 + .../x86/include/rte_power_intrinsics.h | 115 ----------------- lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 120 ++++++++++++++++++ 11 files changed, 215 insertions(+), 198 deletions(-) create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h index a4a1bc1159..9e498e9ebf 100644 --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h @@ -13,46 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - RTE_SET_USED(tsc_timestamp); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build index d62875ebae..6ec53ea03a 100644 --- a/lib/librte_eal/arm/meson.build +++ b/lib/librte_eal/arm/meson.build @@ -7,4 +7,5 @@ sources += files( 'rte_cpuflags.c', 'rte_cycles.c', 'rte_hypervisor.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c new file mode 100644 index 0000000000..e5a49facb4 --- /dev/null +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -0,0 +1,42 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2021 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +/** + * This function is not supported on ARM. + */ +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on ARM. + */ +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(lck); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on ARM. + */ +void rte_power_pause(const uint64_t tsc_timestamp) +{ + RTE_SET_USED(tsc_timestamp); +} diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index dd520d90fa..67977bd511 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -52,7 +52,7 @@ * to undefined result. */ __rte_experimental -static inline void rte_power_monitor(const volatile void *p, +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz); @@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p, * wakes up. */ __rte_experimental -static inline void rte_power_monitor_sync(const volatile void *p, +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck); @@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p, * architecture-dependent. */ __rte_experimental -static inline void rte_power_pause(const uint64_t tsc_timestamp); +void rte_power_pause(const uint64_t tsc_timestamp); #endif /* _RTE_POWER_INTRINSIC_H_ */ diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h index 4ed03d521f..c0e9ac279f 100644 --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h @@ -13,46 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - RTE_SET_USED(tsc_timestamp); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/ppc/meson.build b/lib/librte_eal/ppc/meson.build index f4b6d95c42..43c46542fb 100644 --- a/lib/librte_eal/ppc/meson.build +++ b/lib/librte_eal/ppc/meson.build @@ -7,4 +7,5 @@ sources += files( 'rte_cpuflags.c', 'rte_cycles.c', 'rte_hypervisor.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c new file mode 100644 index 0000000000..785effabe6 --- /dev/null +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -0,0 +1,42 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2021 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +/** + * This function is not supported on PPC64. + */ +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on PPC64. + */ +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(lck); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on PPC64. + */ +void rte_power_pause(const uint64_t tsc_timestamp) +{ + RTE_SET_USED(tsc_timestamp); +} diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 354c068f31..31bf76ae81 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -403,6 +403,11 @@ EXPERIMENTAL { rte_service_lcore_may_be_active; rte_vect_get_max_simd_bitwidth; rte_vect_set_max_simd_bitwidth; + + # added in 21.02 + rte_power_monitor; + rte_power_monitor_sync; + rte_power_pause; }; INTERNAL { diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h index c7d790c854..e4c2b87f73 100644 --- a/lib/librte_eal/x86/include/rte_power_intrinsics.h +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h @@ -13,121 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -static inline uint64_t -__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz) -{ - switch (sz) { - case sizeof(uint8_t): - return *(const volatile uint8_t *)p; - case sizeof(uint16_t): - return *(const volatile uint16_t *)p; - case sizeof(uint32_t): - return *(const volatile uint32_t *)p; - case sizeof(uint64_t): - return *(const volatile uint64_t *)p; - default: - /* this is an intrinsic, so we can't have any error handling */ - RTE_ASSERT(0); - return 0; - } -} - -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(p)); - - if (value_mask) { - const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; - - /* if the masked value is already matching, abort */ - if (masked == expected_value) - return; - } - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); -} - -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(p)); - - if (value_mask) { - const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; - - /* if the masked value is already matching, abort */ - if (masked == expected_value) - return; - } - rte_spinlock_unlock(lck); - - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); - - rte_spinlock_lock(lck); -} - -/** - * This function uses TPAUSE instruction and will enter C0.2 state. For more - * information about usage of this instruction, please refer to Intel(R) 64 and - * IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - - /* execute TPAUSE */ - asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build index e78f29002e..dfd42dee0c 100644 --- a/lib/librte_eal/x86/meson.build +++ b/lib/librte_eal/x86/meson.build @@ -8,4 +8,5 @@ sources += files( 'rte_cycles.c', 'rte_hypervisor.c', 'rte_spinlock.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c new file mode 100644 index 0000000000..34c5fd9c3e --- /dev/null +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -0,0 +1,120 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +static inline uint64_t +__get_umwait_val(const volatile void *p, const uint8_t sz) +{ + switch (sz) { + case sizeof(uint8_t): + return *(const volatile uint8_t *)p; + case sizeof(uint16_t): + return *(const volatile uint16_t *)p; + case sizeof(uint32_t): + return *(const volatile uint32_t *)p; + case sizeof(uint64_t): + return *(const volatile uint64_t *)p; + default: + /* this is an intrinsic, so we can't have any error handling */ + RTE_ASSERT(0); + return 0; + } +} + +/** + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. + * For more information about usage of these instructions, please refer to + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + + if (value_mask) { + const uint64_t cur_value = __get_umwait_val(p, data_sz); + const uint64_t masked = cur_value & value_mask; + + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return; + } + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} + +/** + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. + * For more information about usage of these instructions, please refer to + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + + if (value_mask) { + const uint64_t cur_value = __get_umwait_val(p, data_sz); + const uint64_t masked = cur_value & value_mask; + + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return; + } + rte_spinlock_unlock(lck); + + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); + + rte_spinlock_lock(lck); +} + +/** + * This function uses TPAUSE instruction and will enter C0.2 state. For more + * information about usage of this instruction, please refer to Intel(R) 64 and + * IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_pause(const uint64_t tsc_timestamp) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* execute TPAUSE */ + asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 02/11] eal: avoid invalid API usage in power intrinsics 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 01/11] eal: uninline power intrinsics Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 03/11] eal: change API of " Anatoly Burakov ` (9 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, the API documentation mandates that if the user wants to use the power management intrinsics, they need to call the `rte_cpu_get_intrinsics_support` API and check support for specific intrinsics. However, if the user does not do that, it is possible to get illegal instruction error because we're using raw instruction opcodes, which may or may not be supported at runtime. Now that we have everything in a C file, we can check for support at startup and prevent the user from possibly encountering illegal instruction errors. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v14: - Replace uint8_t with bool .../include/generic/rte_power_intrinsics.h | 3 -- lib/librte_eal/x86/rte_power_intrinsics.c | 31 +++++++++++++++++-- 2 files changed, 28 insertions(+), 6 deletions(-) diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 67977bd511..ffa72f7578 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -34,7 +34,6 @@ * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param p * Address to monitor for changes. @@ -75,7 +74,6 @@ void rte_power_monitor(const volatile void *p, * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param p * Address to monitor for changes. @@ -111,7 +109,6 @@ void rte_power_monitor_sync(const volatile void *p, * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 34c5fd9c3e..050ae612a8 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -4,6 +4,8 @@ #include "rte_power_intrinsics.h" +static bool wait_supported; + static inline uint64_t __get_umwait_val(const volatile void *p, const uint8_t sz) { @@ -35,6 +37,11 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -72,6 +79,11 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -112,9 +124,22 @@ rte_power_pause(const uint64_t tsc_timestamp) const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + /* execute TPAUSE */ asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} + +RTE_INIT(rte_power_intrinsics_init) { + struct rte_cpu_intrinsics i; + + rte_cpu_get_intrinsics_support(&i); + + if (i.power_monitor && i.power_pause) + wait_supported = 1; } -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 03/11] eal: change API of power intrinsics 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 01/11] eal: uninline power intrinsics Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 02/11] eal: avoid invalid API usage in " Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 04/11] eal: remove sync version of power monitor Anatoly Burakov ` (8 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Timothy McDaniel, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Bruce Richardson, Konstantin Ananyev, thomas, david.hunt, chris.macnamara Instead of passing around pointers and integers, collect everything into struct. This makes API design around these intrinsics much easier. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- drivers/event/dlb/dlb.c | 10 ++-- drivers/event/dlb2/dlb2.c | 10 ++-- lib/librte_eal/arm/rte_power_intrinsics.c | 25 ++++------ .../include/generic/rte_power_intrinsics.h | 49 ++++++++----------- lib/librte_eal/ppc/rte_power_intrinsics.c | 25 ++++------ lib/librte_eal/x86/rte_power_intrinsics.c | 32 ++++++------ 6 files changed, 70 insertions(+), 81 deletions(-) diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c index 0c95c4793d..d2f2026291 100644 --- a/drivers/event/dlb/dlb.c +++ b/drivers/event/dlb/dlb.c @@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb, /* Interrupts not supported by PF PMD */ return 1; } else if (dlb->umwait_allowed) { + struct rte_power_monitor_cond pmc; volatile struct dlb_dequeue_qe *cq_base; union { uint64_t raw_qe[2]; @@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb, else expected_value = 0; - rte_power_monitor(monitor_addr, expected_value, - qe_mask.raw_qe[1], timeout + start_ticks, - sizeof(uint64_t)); + pmc.addr = monitor_addr; + pmc.val = expected_value; + pmc.mask = qe_mask.raw_qe[1]; + pmc.data_sz = sizeof(uint64_t); + + rte_power_monitor(&pmc, timeout + start_ticks); DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1); } else { diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c index 86724863f2..c9a8a02278 100644 --- a/drivers/event/dlb2/dlb2.c +++ b/drivers/event/dlb2/dlb2.c @@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, if (elapsed_ticks >= timeout) { return 1; } else if (dlb2->umwait_allowed) { + struct rte_power_monitor_cond pmc; volatile struct dlb2_dequeue_qe *cq_base; union { uint64_t raw_qe[2]; @@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, else expected_value = 0; - rte_power_monitor(monitor_addr, expected_value, - qe_mask.raw_qe[1], timeout + start_ticks, - sizeof(uint64_t)); + pmc.addr = monitor_addr; + pmc.val = expected_value; + pmc.mask = qe_mask.raw_qe[1]; + pmc.data_sz = sizeof(uint64_t); + + rte_power_monitor(&pmc, timeout + start_ticks); DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1); } else { diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index e5a49facb4..f2c3506b90 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -7,36 +7,31 @@ /** * This function is not supported on ARM. */ -void rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +void +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); } /** * This function is not supported on ARM. */ -void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +void +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); - RTE_SET_USED(data_sz); } /** * This function is not supported on ARM. */ -void rte_power_pause(const uint64_t tsc_timestamp) +void +rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); } diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index ffa72f7578..00c670cb50 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -18,6 +18,18 @@ * which are architecture-dependent. */ +struct rte_power_monitor_cond { + volatile void *addr; /**< Address to monitor for changes */ + uint64_t val; /**< Before attempting the monitoring, the address + * may be read and compared against this value. + **/ + uint64_t mask; /**< 64-bit mask to extract current value from addr */ + uint8_t data_sz; /**< Data size (in bytes) that will be used to compare + * expected value with the memory address. Can be 1, + * 2, 4, or 8. Supplying any other value will lead to + * undefined result. */ +}; + /** * @warning * @b EXPERIMENTAL: this API may change without prior notice @@ -35,25 +47,15 @@ * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. * - * @param p - * Address to monitor for changes. - * @param expected_value - * Before attempting the monitoring, the `p` address may be read and compared - * against this value. If `value_mask` is zero, this step will be skipped. - * @param value_mask - * The 64-bit mask to use to extract current value from `p`. + * @param pmc + * The monitoring condition structure. * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. - * @param data_sz - * Data size (in bytes) that will be used to compare expected value with the - * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead - * to undefined result. */ __rte_experimental -void rte_power_monitor(const volatile void *p, - const uint64_t expected_value, const uint64_t value_mask, - const uint64_t tsc_timestamp, const uint8_t data_sz); +void rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp); /** * @warning @@ -75,30 +77,19 @@ void rte_power_monitor(const volatile void *p, * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. * - * @param p - * Address to monitor for changes. - * @param expected_value - * Before attempting the monitoring, the `p` address may be read and compared - * against this value. If `value_mask` is zero, this step will be skipped. - * @param value_mask - * The 64-bit mask to use to extract current value from `p`. + * @param pmc + * The monitoring condition structure. * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. - * @param data_sz - * Data size (in bytes) that will be used to compare expected value with the - * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead - * to undefined result. * @param lck * A spinlock that must be locked before entering the function, will be * unlocked while the CPU is sleeping, and will be locked again once the CPU * wakes up. */ __rte_experimental -void rte_power_monitor_sync(const volatile void *p, - const uint64_t expected_value, const uint64_t value_mask, - const uint64_t tsc_timestamp, const uint8_t data_sz, - rte_spinlock_t *lck); +void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck); /** * @warning diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 785effabe6..3897d2024d 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -7,36 +7,31 @@ /** * This function is not supported on PPC64. */ -void rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +void +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); } /** * This function is not supported on PPC64. */ -void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +void +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); - RTE_SET_USED(data_sz); } /** * This function is not supported on PPC64. */ -void rte_power_pause(const uint64_t tsc_timestamp) +void +rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); } diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 050ae612a8..fd061e8e50 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -31,9 +31,8 @@ __get_umwait_val(const volatile void *p, const uint8_t sz) * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); @@ -50,14 +49,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, /* set address for UMONITOR */ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" : - : "D"(p)); + : "D"(pmc->addr)); - if (value_mask) { - const uint64_t cur_value = __get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; + if (pmc->mask) { + const uint64_t cur_value = __get_umwait_val( + pmc->addr, pmc->data_sz); + const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == expected_value) + if (masked == pmc->val) return; } /* execute UMWAIT */ @@ -73,9 +73,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); @@ -92,14 +91,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, /* set address for UMONITOR */ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" : - : "D"(p)); + : "D"(pmc->addr)); - if (value_mask) { - const uint64_t cur_value = __get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; + if (pmc->mask) { + const uint64_t cur_value = __get_umwait_val( + pmc->addr, pmc->data_sz); + const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == expected_value) + if (masked == pmc->val) return; } rte_spinlock_unlock(lck); -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 04/11] eal: remove sync version of power monitor 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov ` (2 preceding siblings ...) 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 03/11] eal: change API of " Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 05/11] eal: add monitor wakeup function Anatoly Burakov ` (7 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, the "sync" version of power monitor intrinsic is supposed to be used for purposes of waking up a sleeping core. However, there are better ways to achieve the same result, so remove the unneeded function. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- lib/librte_eal/arm/rte_power_intrinsics.c | 12 ----- .../include/generic/rte_power_intrinsics.h | 34 -------------- lib/librte_eal/ppc/rte_power_intrinsics.c | 12 ----- lib/librte_eal/version.map | 1 - lib/librte_eal/x86/rte_power_intrinsics.c | 46 ------------------- 5 files changed, 105 deletions(-) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index f2c3506b90..6b8219b919 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -15,18 +15,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, RTE_SET_USED(tsc_timestamp); } -/** - * This function is not supported on ARM. - */ -void -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - RTE_SET_USED(pmc); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); -} - /** * This function is not supported on ARM. */ diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 00c670cb50..a6f1955996 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -57,40 +57,6 @@ __rte_experimental void rte_power_monitor(const struct rte_power_monitor_cond *pmc, const uint64_t tsc_timestamp); -/** - * @warning - * @b EXPERIMENTAL: this API may change without prior notice - * - * Monitor specific address for changes. This will cause the CPU to enter an - * architecture-defined optimized power state until either the specified - * memory address is written to, a certain TSC timestamp is reached, or other - * reasons cause the CPU to wake up. - * - * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If - * mask is non-zero, the current value pointed to by the `p` pointer will be - * checked against the expected value, and if they match, the entering of - * optimized power state may be aborted. - * - * This call will also lock a spinlock on entering sleep, and release it on - * waking up the CPU. - * - * @warning It is responsibility of the user to check if this function is - * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * - * @param pmc - * The monitoring condition structure. - * @param tsc_timestamp - * Maximum TSC timestamp to wait for. Note that the wait behavior is - * architecture-dependent. - * @param lck - * A spinlock that must be locked before entering the function, will be - * unlocked while the CPU is sleeping, and will be locked again once the CPU - * wakes up. - */ -__rte_experimental -void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck); - /** * @warning * @b EXPERIMENTAL: this API may change without prior notice diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 3897d2024d..9a40c4d5d6 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -15,18 +15,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, RTE_SET_USED(tsc_timestamp); } -/** - * This function is not supported on PPC64. - */ -void -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - RTE_SET_USED(pmc); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); -} - /** * This function is not supported on PPC64. */ diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 31bf76ae81..20945b1efa 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -406,7 +406,6 @@ EXPERIMENTAL { # added in 21.02 rte_power_monitor; - rte_power_monitor_sync; rte_power_pause; }; diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index fd061e8e50..14c88600f0 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -67,52 +67,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, "a"(tsc_l), "d"(tsc_h)); } -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -void -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - - /* prevent user from running this instruction if it's not supported */ - if (!wait_supported) - return; - - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(pmc->addr)); - - if (pmc->mask) { - const uint64_t cur_value = __get_umwait_val( - pmc->addr, pmc->data_sz); - const uint64_t masked = cur_value & pmc->mask; - - /* if the masked value is already matching, abort */ - if (masked == pmc->val) - return; - } - rte_spinlock_unlock(lck); - - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); - - rte_spinlock_lock(lck); -} - /** * This function uses TPAUSE instruction and will enter C0.2 state. For more * information about usage of this instruction, please refer to Intel(R) 64 and -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 05/11] eal: add monitor wakeup function 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov ` (3 preceding siblings ...) 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 04/11] eal: remove sync version of power monitor Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 06/11] ethdev: add simple power management API Anatoly Burakov ` (6 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Now that we have everything in a C file, we can store the information about our sleep, and have a native mechanism to wake up the sleeping core. This mechanism would however only wake up a core that's sleeping while monitoring - waking up from `rte_power_pause` won't work. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v13: - Add comments around wakeup code to explain what it does - Add lcore_id parameter checking to prevent buffer overrun lib/librte_eal/arm/rte_power_intrinsics.c | 9 ++ .../include/generic/rte_power_intrinsics.h | 16 ++++ lib/librte_eal/ppc/rte_power_intrinsics.c | 9 ++ lib/librte_eal/version.map | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 85 +++++++++++++++++++ 5 files changed, 120 insertions(+) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index 6b8219b919..14081a2c5b 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -23,3 +23,12 @@ rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); } + +/** + * This function is not supported on ARM. + */ +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + RTE_SET_USED(lcore_id); +} diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index a6f1955996..e311d6f8ea 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -57,6 +57,22 @@ __rte_experimental void rte_power_monitor(const struct rte_power_monitor_cond *pmc, const uint64_t tsc_timestamp); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Wake up a specific lcore that is in a power optimized state and is monitoring + * an address. + * + * @note This function will *not* wake up a core that is in a power optimized + * state due to calling `rte_power_pause`. + * + * @param lcore_id + * Lcore ID of a sleeping thread. + */ +__rte_experimental +void rte_power_monitor_wakeup(const unsigned int lcore_id); + /** * @warning * @b EXPERIMENTAL: this API may change without prior notice diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 9a40c4d5d6..a7db61a7c3 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -23,3 +23,12 @@ rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); } + +/** + * This function is not supported on PPC64. + */ +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + RTE_SET_USED(lcore_id); +} diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 20945b1efa..ac026e289d 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -406,6 +406,7 @@ EXPERIMENTAL { # added in 21.02 rte_power_monitor; + rte_power_monitor_wakeup; rte_power_pause; }; diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 14c88600f0..7cbe156199 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -2,8 +2,31 @@ * Copyright(c) 2020 Intel Corporation */ +#include <rte_common.h> +#include <rte_lcore.h> +#include <rte_spinlock.h> + #include "rte_power_intrinsics.h" +/* + * Per-lcore structure holding current status of C0.2 sleeps. + */ +static struct power_wait_status { + rte_spinlock_t lock; + volatile void *monitor_addr; /**< NULL if not currently sleeping */ +} __rte_cache_aligned wait_status[RTE_MAX_LCORE]; + +static inline void +__umwait_wakeup(volatile void *addr) +{ + uint64_t val; + + /* trigger a write but don't change the value */ + val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED); + __atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0, + __ATOMIC_RELAXED, __ATOMIC_RELAXED); +} + static bool wait_supported; static inline uint64_t @@ -36,6 +59,12 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + const unsigned int lcore_id = rte_lcore_id(); + struct power_wait_status *s; + + /* prevent non-EAL thread from using this API */ + if (lcore_id >= RTE_MAX_LCORE) + return; /* prevent user from running this instruction if it's not supported */ if (!wait_supported) @@ -60,11 +89,24 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, if (masked == pmc->val) return; } + + s = &wait_status[lcore_id]; + + /* update sleep address */ + rte_spinlock_lock(&s->lock); + s->monitor_addr = pmc->addr; + rte_spinlock_unlock(&s->lock); + /* execute UMWAIT */ asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" : /* ignore rflags */ : "D"(0), /* enter C0.2 */ "a"(tsc_l), "d"(tsc_h)); + + /* erase sleep address */ + rte_spinlock_lock(&s->lock); + s->monitor_addr = NULL; + rte_spinlock_unlock(&s->lock); } /** @@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) { if (i.power_monitor && i.power_pause) wait_supported = 1; } + +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + struct power_wait_status *s; + + /* prevent buffer overrun */ + if (lcore_id >= RTE_MAX_LCORE) + return; + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + + s = &wait_status[lcore_id]; + + /* + * There is a race condition between sleep, wakeup and locking, but we + * don't need to handle it. + * + * Possible situations: + * + * 1. T1 locks, sets address, unlocks + * 2. T2 locks, triggers wakeup, unlocks + * 3. T1 sleeps + * + * In this case, because T1 has already set the address for monitoring, + * we will wake up immediately even if T2 triggers wakeup before T1 + * goes to sleep. + * + * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up + * 2. T2 locks, triggers wakeup, and unlocks + * 3. T1 locks, erases address, and unlocks + * + * In this case, since we've already woken up, the "wakeup" was + * unneeded, and since T1 is still waiting on T2 releasing the lock, the + * wakeup address is still valid so it's perfectly safe to write it. + */ + rte_spinlock_lock(&s->lock); + if (s->monitor_addr != NULL) + __umwait_wakeup(s->monitor_addr); + rte_spinlock_unlock(&s->lock); +} -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 06/11] ethdev: add simple power management API 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov ` (4 preceding siblings ...) 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 05/11] eal: add monitor wakeup function Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 07/11] power: add PMD power management API and callback Anatoly Burakov ` (5 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella, Neil Horman, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add a simple API to allow getting the monitor conditions for power-optimized monitoring of the Rx queues from the PMD, as well as release notes information. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> --- Notes: v13: - Fix typos and issues raised by Andrew doc/guides/rel_notes/release_21_02.rst | 5 +++++ lib/librte_ethdev/rte_ethdev.c | 28 ++++++++++++++++++++++++++ lib/librte_ethdev/rte_ethdev.h | 25 +++++++++++++++++++++++ lib/librte_ethdev/rte_ethdev_driver.h | 22 ++++++++++++++++++++ lib/librte_ethdev/version.map | 3 +++ 5 files changed, 83 insertions(+) diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst index 638f98168b..6de0cb568e 100644 --- a/doc/guides/rel_notes/release_21_02.rst +++ b/doc/guides/rel_notes/release_21_02.rst @@ -55,6 +55,11 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **ethdev: added new API for PMD power management** + + * ``rte_eth_get_monitor_addr()``, to be used in conjunction with + ``rte_power_monitor()`` to enable automatic power management for PMD's. + Removed Items ------------- diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c index 17ddacc78d..e19dbd838b 100644 --- a/lib/librte_ethdev/rte_ethdev.c +++ b/lib/librte_ethdev/rte_ethdev.c @@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode)); } +int +rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id, + struct rte_power_monitor_cond *pmc) +{ + struct rte_eth_dev *dev; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV); + + dev = &rte_eth_devices[port_id]; + + RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP); + + if (queue_id >= dev->data->nb_rx_queues) { + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id); + return -EINVAL; + } + + if (pmc == NULL) { + RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n", + pmc); + return -EINVAL; + } + + return eth_err(port_id, + dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id], + pmc)); +} + int rte_eth_dev_set_mc_addr_list(uint16_t port_id, struct rte_ether_addr *mc_addr_set, diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h index f5f8919186..ca0f91312e 100644 --- a/lib/librte_ethdev/rte_ethdev.h +++ b/lib/librte_ethdev/rte_ethdev.h @@ -157,6 +157,7 @@ extern "C" { #include <rte_common.h> #include <rte_config.h> #include <rte_ether.h> +#include <rte_power_intrinsics.h> #include "rte_ethdev_trace_fp.h" #include "rte_dev_info.h" @@ -4334,6 +4335,30 @@ __rte_experimental int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, struct rte_eth_burst_mode *mode); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Retrieve the monitor condition for a given receive queue. + * + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The Rx queue on the Ethernet device for which information + * will be retrieved. + * @param pmc + * The pointer point to power-optimized monitoring condition structure. + * + * @return + * - 0: Success. + * -ENOTSUP: Operation not supported. + * -EINVAL: Invalid parameters. + * -ENODEV: Invalid port ID. + */ +__rte_experimental +int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id, + struct rte_power_monitor_cond *pmc); + /** * Retrieve device registers and register attributes (number of registers and * register size) diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h index 0eacfd8425..3b3b0ec1a0 100644 --- a/lib/librte_ethdev/rte_ethdev_driver.h +++ b/lib/librte_ethdev/rte_ethdev_driver.h @@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t) (struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction); /**< @internal Unbind peer queue from the current queue. */ +/** + * @internal + * Get address of memory location whose contents will change whenever there is + * new data to be received on an Rx queue. + * + * @param rxq + * Ethdev queue pointer. + * @param pmc + * The pointer to power-optimized monitoring condition structure. + * @return + * Negative errno value on error, 0 on success. + * + * @retval 0 + * Success + * @retval -EINVAL + * Invalid parameters + */ +typedef int (*eth_get_monitor_addr_t)(void *rxq, + struct rte_power_monitor_cond *pmc); + /** * @internal A structure containing the functions exported by an Ethernet driver. */ @@ -917,6 +937,8 @@ struct eth_dev_ops { /**< Set up the connection between the pair of hairpin queues. */ eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind; /**< Disconnect the hairpin queues of a pair from each other. */ + eth_get_monitor_addr_t get_monitor_addr; + /**< Get power monitoring condition for Rx queue. */ }; /** diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map index d3f5410806..a124e1e370 100644 --- a/lib/librte_ethdev/version.map +++ b/lib/librte_ethdev/version.map @@ -240,6 +240,9 @@ EXPERIMENTAL { rte_flow_get_restore_info; rte_flow_tunnel_action_decap_release; rte_flow_tunnel_item_release; + + # added in 21.02 + rte_eth_get_monitor_addr; }; INTERNAL { -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 07/11] power: add PMD power management API and callback 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov ` (5 preceding siblings ...) 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 06/11] ethdev: add simple power management API Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 08/11] net/ixgbe: implement power management API Anatoly Burakov ` (4 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas, konstantin.ananyev, timothy.mcdaniel, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add a simple on/off switch that will enable saving power when no packets are arriving. It is based on counting the number of empty polls and, when the number reaches a certain threshold, entering an architecture-defined optimized power state that will either wait until a TSC timestamp expires, or when packets arrive. This API mandates a core-to-single-queue mapping (that is, multiple queued per device are supported, but they have to be polled on different cores). This design is using PMD RX callbacks. 1. UMWAIT/UMONITOR: When a certain threshold of empty polls is reached, the core will go into a power optimized sleep while waiting on an address of next RX descriptor to be written to. 2. TPAUSE/Pause instruction This method uses the pause (or TPAUSE, if available) instruction to avoid busy polling. 3. Frequency scaling Reuse existing DPDK power library to scale up/down core frequency depending on traffic volume. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v13: - Rework the synchronization mechanism to not require locking - Add more parameter checking - Rework n_rx_queues access to not go through internal PMD structures and use public API instead doc/guides/prog_guide/power_man.rst | 44 +++ doc/guides/rel_notes/release_21_02.rst | 10 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 360 +++++++++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++++ lib/librte_power/version.map | 5 + 6 files changed, 512 insertions(+), 2 deletions(-) create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index 0a3755a901..02280dd689 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -192,6 +192,47 @@ User Cases ---------- The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA. +PMD Power Management API +------------------------ + +Abstract +~~~~~~~~ +Existing power management mechanisms require developers to change application +design or change code to make use of it. The PMD power management API provides a +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering +power saving whenever empty poll count reaches a certain number. + + * Monitor + + This power saving scheme will put the CPU into optimized power state and use + the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX + descriptor address, and wake the CPU up whenever there's new traffic. + + * Pause + + This power saving scheme will avoid busy polling by either entering + power-optimized sleep state with ``rte_power_pause()`` function, or, if it's + not available, use ``rte_pause()``. + + * Frequency scaling + + This power saving scheme will use existing ``librte_power`` library + functionality to scale the core frequency up/down depending on traffic + volume. + + +.. note:: + + Currently, this power management API is limited to mandatory mapping of 1 + queue to 1 core (multiple queues are supported, but they must be polled from + different cores). + +API Overview for PMD Power Management +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +* **Queue Enable**: Enable specific power scheme for certain queue/port/core + +* **Queue Disable**: Disable power scheme for certain queue/port/core + References ---------- @@ -200,3 +241,6 @@ References * The :doc:`../sample_app_ug/vm_power_management` chapter in the :doc:`../sample_app_ug/index` section. + +* The :doc:`../sample_app_ug/rxtx_callbacks` + chapter in the :doc:`../sample_app_ug/index` section. diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst index 6de0cb568e..b34828cad6 100644 --- a/doc/guides/rel_notes/release_21_02.rst +++ b/doc/guides/rel_notes/release_21_02.rst @@ -60,6 +60,16 @@ New Features * ``rte_eth_get_monitor_addr()``, to be used in conjunction with ``rte_power_monitor()`` to enable automatic power management for PMD's. +* **Add PMD power management helper API** + + A new helper API has been added to make using Ethernet PMD power management + easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power + management schemes are supported initially: + + * Power saving based on UMWAIT instruction (x86 only) + * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only) + * Power saving based on frequency scaling through the ``librte_power`` library + Removed Items ------------- diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build index 4b4cf1b90b..51a471b669 100644 --- a/lib/librte_power/meson.build +++ b/lib/librte_power/meson.build @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c', 'power_kvm_vm.c', 'guest_channel.c', 'rte_power_empty_poll.c', 'power_pstate_cpufreq.c', + 'rte_power_pmd_mgmt.c', 'power_common.c') -headers = files('rte_power.h','rte_power_empty_poll.h') -deps += ['timer'] +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h') +deps += ['timer' ,'ethdev'] diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c new file mode 100644 index 0000000000..65597d354c --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.c @@ -0,0 +1,360 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <rte_lcore.h> +#include <rte_cycles.h> +#include <rte_cpuflags.h> +#include <rte_malloc.h> +#include <rte_ethdev.h> +#include <rte_power_intrinsics.h> + +#include "rte_power_pmd_mgmt.h" + +#define EMPTYPOLL_MAX 512 + +static struct pmd_conf_data { + struct rte_cpu_intrinsics intrinsics_support; + /**< what do we support? */ + uint64_t tsc_per_us; + /**< pre-calculated tsc diff for 1us */ + uint64_t pause_per_us; + /**< how many rte_pause can we fit in a microisecond? */ +} global_data; + +/** + * Possible power management states of an ethdev port. + */ +enum pmd_mgmt_state { + /** Device power management is disabled. */ + PMD_MGMT_DISABLED = 0, + /** Device power management is enabled. */ + PMD_MGMT_ENABLED, + /** Device powermanagement status is about to change. */ + PMD_MGMT_BUSY +}; + +struct pmd_queue_cfg { + volatile enum pmd_mgmt_state pwr_mgmt_state; + /**< State of power management for this queue */ + enum rte_power_pmd_mgmt_type cb_mode; + /**< Callback mode for this queue */ + const struct rte_eth_rxtx_callback *cur_cb; + /**< Callback instance */ + volatile bool umwait_in_progress; + /**< are we currently sleeping? */ + uint64_t empty_poll_stats; + /**< Number of empty polls */ +} __rte_cache_aligned; + +static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT]; + +static void +calc_tsc(void) +{ + const uint64_t hz = rte_get_timer_hz(); + const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */ + + global_data.tsc_per_us = tsc_per_us; + + /* only do this if we don't have tpause */ + if (!global_data.intrinsics_support.power_pause) { + const uint64_t start = rte_rdtsc_precise(); + const uint32_t n_pauses = 10000; + double us, us_per_pause; + uint64_t end; + unsigned int i; + + /* estimate number of rte_pause() calls per us*/ + for (i = 0; i < n_pauses; i++) + rte_pause(); + + end = rte_rdtsc_precise(); + us = (end - start) / (double)tsc_per_us; + us_per_pause = us / n_pauses; + + global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause); + } +} + +static uint16_t +clb_umwait(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + struct rte_power_monitor_cond pmc; + uint16_t ret; + + /* + * we might get a cancellation request while being + * inside the callback, in which case the wakeup + * wouldn't work because it would've arrived too early. + * + * to get around this, we notify the other thread that + * we're sleeping, so that it can spin until we're done. + * unsolicited wakeups are perfectly safe. + */ + q_conf->umwait_in_progress = true; + + /* check if we need to cancel sleep */ + if (q_conf->pwr_mgmt_state != PMD_MGMT_ENABLED) { + /* use monitoring condition to sleep */ + ret = rte_eth_get_monitor_addr(port_id, qidx, + &pmc); + if (ret == 0) + rte_power_monitor(&pmc, -1ULL); + } + q_conf->umwait_in_progress = false; + } + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_pause(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + /* sleep for 1 microsecond */ + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + /* use tpause if we have it */ + if (global_data.intrinsics_support.power_pause) { + const uint64_t cur = rte_rdtsc(); + const uint64_t wait_tsc = + cur + global_data.tsc_per_us; + rte_power_pause(wait_tsc); + } else { + uint64_t i; + for (i = 0; i < global_data.pause_per_us; i++) + rte_pause(); + } + } + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_scale_freq(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *_ __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + /* scale down freq */ + rte_power_freq_min(rte_lcore_id()); + } else { + q_conf->empty_poll_stats = 0; + /* scale up freq */ + rte_power_freq_max(rte_lcore_id()); + } + + return nb_rx; +} + +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode) +{ + struct pmd_queue_cfg *queue_cfg; + struct rte_eth_dev_info info; + int ret; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + + if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) { + ret = -EINVAL; + goto end; + } + + if (rte_eth_dev_info_get(port_id, &info) < 0) { + ret = -EINVAL; + goto end; + } + + /* check if queue id is valid */ + if (queue_id >= info.nb_rx_queues) { + ret = -EINVAL; + goto end; + } + + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) { + ret = -EINVAL; + goto end; + } + + /* we're about to change our state */ + queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; + + /* we need this in various places */ + rte_cpu_get_intrinsics_support(&global_data.intrinsics_support); + + switch (mode) { + case RTE_POWER_MGMT_TYPE_MONITOR: + { + struct rte_power_monitor_cond dummy; + + /* check if rte_power_monitor is supported */ + if (!global_data.intrinsics_support.power_monitor) { + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); + ret = -ENOTSUP; + goto rollback; + } + + /* check if the device supports the necessary PMD API */ + if (rte_eth_get_monitor_addr(port_id, queue_id, + &dummy) == -ENOTSUP) { + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n"); + ret = -ENOTSUP; + goto rollback; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->umwait_in_progress = false; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_umwait, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_SCALE: + { + enum power_management_env env; + /* only PSTATE and ACPI modes are supported */ + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && + !rte_power_check_env_supported( + PM_ENV_PSTATE_CPUFREQ)) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); + ret = -ENOTSUP; + goto rollback; + } + /* ensure we could initialize the power library */ + if (rte_power_init(lcore_id)) { + ret = -EINVAL; + goto rollback; + } + /* ensure we initialized the correct env */ + env = rte_power_get_env(); + if (env != PM_ENV_ACPI_CPUFREQ && + env != PM_ENV_PSTATE_CPUFREQ) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); + ret = -ENOTSUP; + goto rollback; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, + queue_id, clb_scale_freq, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_PAUSE: + /* figure out various time-to-tsc conversions */ + if (global_data.tsc_per_us == 0) + calc_tsc(); + + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_pause, NULL); + break; + } + ret = 0; + + return ret; + +rollback: + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; +end: + return ret; +} + +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id) +{ + struct pmd_queue_cfg *queue_cfg; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + + if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT) + return -EINVAL; + + /* no need to check queue id as wrong queue id would not be enabled */ + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) + return -EINVAL; + + /* let the callback know we're shutting down */ + queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; + + switch (queue_cfg->cb_mode) { + case RTE_POWER_MGMT_TYPE_MONITOR: + { + bool exit = false; + do { + /* + * we may request cancellation while the other thread + * has just entered the callback but hasn't started + * sleeping yet, so keep waking it up until we know it's + * done sleeping. + */ + if (queue_cfg->umwait_in_progress) + rte_power_monitor_wakeup(lcore_id); + else + exit = true; + } while (!exit); + } + /* fall-through */ + case RTE_POWER_MGMT_TYPE_PAUSE: + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + break; + case RTE_POWER_MGMT_TYPE_SCALE: + rte_power_freq_max(lcore_id); + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + rte_power_exit(lcore_id); + break; + } + /* + * we don't free the RX callback here because it is unsafe to do so + * unless we know for a fact that all data plane threads have stopped. + */ + queue_cfg->cur_cb = NULL; + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + + return 0; +} diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h new file mode 100644 index 0000000000..0bfbc6ba69 --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#ifndef _RTE_POWER_PMD_MGMT_H +#define _RTE_POWER_PMD_MGMT_H + +/** + * @file + * RTE PMD Power Management + */ +#include <stdint.h> +#include <stdbool.h> + +#include <rte_common.h> +#include <rte_byteorder.h> +#include <rte_log.h> +#include <rte_power.h> +#include <rte_atomic.h> + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * PMD Power Management Type + */ +enum rte_power_pmd_mgmt_type { + /** Use power-optimized monitoring to wait for incoming traffic */ + RTE_POWER_MGMT_TYPE_MONITOR = 1, + /** Use power-optimized sleep to avoid busy polling */ + RTE_POWER_MGMT_TYPE_PAUSE, + /** Use frequency scaling when traffic is low */ + RTE_POWER_MGMT_TYPE_SCALE, +}; + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Enable power management on a specified RX queue and lcore. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @param mode + * The power management callback function type. + + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode); + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Disable power management on a specified RX queue and lcore. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id); +#ifdef __cplusplus +} +#endif + +#endif diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map index 69ca9af616..61996b4d11 100644 --- a/lib/librte_power/version.map +++ b/lib/librte_power/version.map @@ -34,4 +34,9 @@ EXPERIMENTAL { rte_power_guest_channel_receive_msg; rte_power_poll_stat_fetch; rte_power_poll_stat_update; + + # added in 21.02 + rte_power_pmd_mgmt_queue_enable; + rte_power_pmd_mgmt_queue_disable; + }; -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 08/11] net/ixgbe: implement power management API 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov ` (6 preceding siblings ...) 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 07/11] power: add PMD power management API and callback Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 09/11] net/i40e: " Anatoly Burakov ` (3 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Liang Ma, Jeff Guo, Haiyue Wang, thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: Liang Ma <liang.j.ma@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 +++++++++++++++++++++++++ drivers/net/ixgbe/ixgbe_rxtx.h | 1 + 3 files changed, 27 insertions(+) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index 9a47a8b262..4b7a5ca60b 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -560,6 +560,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = { .udp_tunnel_port_del = ixgbe_dev_udp_tunnel_port_del, .tm_ops_get = ixgbe_tm_ops_get, .tx_done_cleanup = ixgbe_dev_tx_done_cleanup, + .get_monitor_addr = ixgbe_get_monitor_addr, }; /* diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index 6cfbb582e2..7e046a1819 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.c +++ b/drivers/net/ixgbe/ixgbe_rxtx.c @@ -1369,6 +1369,31 @@ const uint32_t RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP, }; +int +ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + volatile union ixgbe_adv_rx_desc *rxdp; + struct ixgbe_rx_queue *rxq = rx_queue; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.upper.status_error; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + + /* the registers are 32-bit */ + pmc->data_sz = sizeof(uint32_t); + + return 0; +} + /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */ static inline uint32_t ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask) diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h index 6d2f7c9da3..8a25e98df6 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.h +++ b/drivers/net/ixgbe/ixgbe_rxtx.h @@ -299,5 +299,6 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev); +int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); #endif /* _IXGBE_RXTX_H_ */ -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 09/11] net/i40e: implement power management API 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov ` (7 preceding siblings ...) 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 08/11] net/ixgbe: implement power management API Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 10/11] net/ice: " Anatoly Burakov ` (2 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Liang Ma, Beilei Xing, Jeff Guo, thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Acked-by: Jeff Guo <jia.guo@intel.com> --- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 25 +++++++++++++++++++++++++ drivers/net/i40e/i40e_rxtx.h | 1 + 3 files changed, 27 insertions(+) diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c index f54769c29d..af2577a140 100644 --- a/drivers/net/i40e/i40e_ethdev.c +++ b/drivers/net/i40e/i40e_ethdev.c @@ -510,6 +510,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = { .mtu_set = i40e_dev_mtu_set, .tm_ops_get = i40e_tm_ops_get, .tx_done_cleanup = i40e_tx_done_cleanup, + .get_monitor_addr = i40e_get_monitor_addr, }; /* store statistics names and its offset in stats structure */ diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 5df9a9df56..0b4220fc9c 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -72,6 +72,31 @@ #define I40E_TX_OFFLOAD_NOTSUP_MASK \ (PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK) +int +i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + struct i40e_rx_queue *rxq = rx_queue; + volatile union i40e_rx_desc *rxdp; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.qword1.status_error_len; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + + /* registers are 64-bit */ + pmc->data_sz = sizeof(uint64_t); + + return 0; +} + static inline void i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp) { diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h index 57d7b4160b..e1494525ce 100644 --- a/drivers/net/i40e/i40e_rxtx.h +++ b/drivers/net/i40e/i40e_rxtx.h @@ -248,6 +248,7 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); +int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); /* For each value it means, datasheet of hardware can tell more details * -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 10/11] net/ice: implement power management API 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov ` (8 preceding siblings ...) 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 09/11] net/i40e: " Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Liang Ma, Qiming Yang, Qi Zhang, thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++++++++++++++++++++++++++ drivers/net/ice/ice_rxtx.h | 1 + 3 files changed, 28 insertions(+) diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c index 9a5d6a559f..c21682c120 100644 --- a/drivers/net/ice/ice_ethdev.c +++ b/drivers/net/ice/ice_ethdev.c @@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = { .udp_tunnel_port_add = ice_dev_udp_tunnel_port_add, .udp_tunnel_port_del = ice_dev_udp_tunnel_port_del, .tx_done_cleanup = ice_tx_done_cleanup, + .get_monitor_addr = ice_get_monitor_addr, }; /* store statistics names and its offset in stats structure */ diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c index 5fbd68eafc..fa9e9a235b 100644 --- a/drivers/net/ice/ice_rxtx.c +++ b/drivers/net/ice/ice_rxtx.c @@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask; uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask; uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask; +int +ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + volatile union ice_rx_flex_desc *rxdp; + struct ice_rx_queue *rxq = rx_queue; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.status_error0; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + + /* register is 16-bit */ + pmc->data_sz = sizeof(uint16_t); + + return 0; +} + + static inline uint8_t ice_proto_xtr_type_to_rxdid(uint8_t xtr_type) { diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h index 6b16716063..906fbefdc4 100644 --- a/drivers/net/ice/ice_rxtx.h +++ b/drivers/net/ice/ice_rxtx.h @@ -263,6 +263,7 @@ uint16_t ice_xmit_pkts_vec_avx512(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc); int ice_tx_done_cleanup(void *txq, uint32_t free_cnt); +int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \ int i; \ -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v14 11/11] examples/l3fwd-power: enable PMD power mgmt 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov ` (9 preceding siblings ...) 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 10/11] net/ice: " Anatoly Burakov @ 2021-01-11 14:35 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:35 UTC (permalink / raw) To: dev Cc: Liang Ma, David Hunt, thomas, konstantin.ananyev, timothy.mcdaniel, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add PMD power management feature support to l3fwd-power sample app. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v12: - Allow selecting PMD power management scheme from command-line - Enforce 1 core 1 queue rule .../sample_app_ug/l3_forward_power_man.rst | 35 ++++++++ examples/l3fwd-power/main.c | 89 ++++++++++++++++++- 2 files changed, 122 insertions(+), 2 deletions(-) diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst index 85a78a5c1e..aaa9367fae 100644 --- a/doc/guides/sample_app_ug/l3_forward_power_man.rst +++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst @@ -109,6 +109,8 @@ where, * --telemetry: Telemetry mode. +* --pmd-mgmt: PMD power management mode. + See :doc:`l3_forward` for details. The L3fwd-power example reuses the L3fwd command line options. @@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set to either 0% or The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``. + +PMD power management Mode +------------------------- + +The PMD power management mode support for ``l3fwd-power`` is a standalone mode, in this mode +``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific +port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API. + +.. code-block:: console + + ./build/examples/dpdk-l3fwd-power -l 1-3 -- --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)" + +PMD Power Management Mode +------------------------- +There is also a traffic-aware operating mode that, instead of using explicit +power management, will use automatic PMD power management. This mode is limited +to one queue per core, and has three available power management schemes: + +* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a + power-optimized state (subject to platform support). + +* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid + busy looping when there is no traffic. + +* ``scale`` - this will use frequency scaling routines available in the + ``librte_power`` library. + +See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK +Programmer's Guide for more details on PMD power management. + +.. code-block:: console + + ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index 995a3b6ad7..e312b6f355 100644 --- a/examples/l3fwd-power/main.c +++ b/examples/l3fwd-power/main.c @@ -47,6 +47,7 @@ #include <rte_power_empty_poll.h> #include <rte_metrics.h> #include <rte_telemetry.h> +#include <rte_power_pmd_mgmt.h> #include "perf_core.h" #include "main.h" @@ -199,11 +200,14 @@ enum appmode { APP_MODE_LEGACY, APP_MODE_EMPTY_POLL, APP_MODE_TELEMETRY, - APP_MODE_INTERRUPT + APP_MODE_INTERRUPT, + APP_MODE_PMD_MGMT }; enum appmode app_mode; +static enum rte_power_pmd_mgmt_type pmgmt_type; + enum freq_scale_hint_t { FREQ_LOWER = -1, @@ -1611,7 +1615,9 @@ print_usage(const char *prgname) " follow (training_flag, high_threshold, med_threshold)\n" " --telemetry: enable telemetry mode, to update" " empty polls, full polls, and core busyness to telemetry\n" - " --interrupt-only: enable interrupt-only mode\n", + " --interrupt-only: enable interrupt-only mode\n" + " --pmd-mgmt MODE: enable PMD power management mode. " + "Currently supported modes: monitor, pause, scale\n", prgname); } @@ -1701,6 +1707,32 @@ parse_config(const char *q_arg) return 0; } + +static int +parse_pmd_mgmt_config(const char *name) +{ +#define PMD_MGMT_MONITOR "monitor" +#define PMD_MGMT_PAUSE "pause" +#define PMD_MGMT_SCALE "scale" + + if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR; + return 0; + } + + if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE; + return 0; + } + + if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE; + return 0; + } + /* unknown PMD power management mode */ + return -1; +} + static int parse_ep_config(const char *q_arg) { @@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg) #define CMD_LINE_OPT_EMPTY_POLL "empty-poll" #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only" #define CMD_LINE_OPT_TELEMETRY "telemetry" +#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt" /* Parse the argument given in the command line of the application */ static int @@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv) {CMD_LINE_OPT_LEGACY, 0, 0, 0}, {CMD_LINE_OPT_TELEMETRY, 0, 0, 0}, {CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0}, + {CMD_LINE_OPT_PMD_MGMT, 1, 0, 0}, {NULL, 0, 0, 0} }; @@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv) printf("telemetry mode is enabled\n"); } + if (!strncmp(lgopts[option_index].name, + CMD_LINE_OPT_PMD_MGMT, + sizeof(CMD_LINE_OPT_PMD_MGMT))) { + if (app_mode != APP_MODE_DEFAULT) { + printf(" power mgmt mode is mutually exclusive with other modes\n"); + return -1; + } + if (parse_pmd_mgmt_config(optarg) < 0) { + printf(" Invalid PMD power management mode: %s\n", + optarg); + return -1; + } + app_mode = APP_MODE_PMD_MGMT; + printf("PMD power mgmt mode is enabled\n"); + } if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_INTERRUPT_ONLY, sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) { @@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode) return "telemetry"; case APP_MODE_INTERRUPT: return "interrupt-only"; + case APP_MODE_PMD_MGMT: + return "pmd mgmt"; default: return "invalid"; } @@ -2671,6 +2722,13 @@ main(int argc, char **argv) qconf = &lcore_conf[lcore_id]; printf("\nInitializing rx queues on lcore %u ... ", lcore_id ); fflush(stdout); + + /* PMD power management mode can only do 1 queue per core */ + if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) { + rte_exit(EXIT_FAILURE, + "In PMD power management mode, only one queue per lcore is allowed\n"); + } + /* init RX queues */ for(queue = 0; queue < qconf->n_rx_queue; ++queue) { struct rte_eth_rxconf rxq_conf; @@ -2708,6 +2766,16 @@ main(int argc, char **argv) rte_exit(EXIT_FAILURE, "Fail to add ptype cb\n"); } + + if (app_mode == APP_MODE_PMD_MGMT) { + ret = rte_power_pmd_mgmt_queue_enable( + lcore_id, portid, queueid, + pmgmt_type); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_power_pmd_mgmt_queue_enable: err=%d, port=%d\n", + ret, portid); + } } } @@ -2798,6 +2866,9 @@ main(int argc, char **argv) SKIP_MAIN); } else if (app_mode == APP_MODE_INTERRUPT) { rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN); + } else if (app_mode == APP_MODE_PMD_MGMT) { + /* reuse telemetry loop for PMD power management mode */ + rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN); } if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY) @@ -2824,6 +2895,20 @@ main(int argc, char **argv) if (app_mode == APP_MODE_EMPTY_POLL) rte_power_empty_poll_stat_free(); + if (app_mode == APP_MODE_PMD_MGMT) { + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + qconf = &lcore_conf[lcore_id]; + for (queue = 0; queue < qconf->n_rx_queue; ++queue) { + portid = qconf->rx_queue_list[queue].port_id; + queueid = qconf->rx_queue_list[queue].queue_id; + rte_power_pmd_mgmt_queue_disable(lcore_id, + portid, queueid); + } + } + } + if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) && deinit_power_library()) rte_exit(EXIT_FAILURE, "deinit_power_library failed\n"); -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 00/11] Add PMD power management 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 00/11] Add PMD power management Anatoly Burakov ` (10 preceding siblings ...) 2021-01-11 14:35 ` [dpdk-dev] [PATCH v14 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics Anatoly Burakov ` (11 more replies) 11 siblings, 12 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara This patchset proposes a simple API for Ethernet drivers to cause the CPU to enter a power-optimized state while waiting for packets to arrive. There are multiple proposed mechanisms to achieve said power savings: simple frequency scaling, idle loop, and monitoring the Rx queue for incoming packages. The latter is achieved through cooperation with the NIC driver that will allow us to know address of wake up event, and wait for writes on that address. On IA, this is achieved through using UMONITOR/UMWAIT instructions. They are used in their raw opcode form because there is no widespread compiler support for them yet. Still, the API is made generic enough to hopefully support other architectures, if they happen to implement similar instructions. To achieve power savings, there is a very simple mechanism used: we're counting empty polls, and if a certain threshold is reached, we employ one of the suggested power management schemes automatically, from within a Rx callback inside the PMD. Once there's traffic again, the empty poll counter is reset. This patchset also introduces a few changes into existing power management-related intrinsics, namely to provide a native way of waking up a sleeping core without application being responsible for it, as well as general robustness improvements. There's quite a bit of locking going on, but these locks are per-thread and very little (if any) contention is expected, so the performance impact shouldn't be that bad (and in any case the locking happens when we're about to sleep anyway). Why are we putting it into ethdev as opposed to leaving this up to the application? Our customers specifically requested a way to do it with minimal changes to the application code. The current approach allows to just flip a switch and automatically have power savings. Things of note: - Only 1:1 core to queue mapping is supported, meaning that each lcore must at most handle RX on a single queue - Support 3 type policies. Monitor/Pause/Frequency Scaling - Power management is enabled per-queue - The API doesn't extend to other device types v15: - Fixed incorrect check in UMWAIT callback - Fixed accidental whitespace changes v14: - Fixed ARM/PPC builds - Addressed various review comments v13: - Reworked the librte_power code to require less locking and handle invalid parameters better - Fix numerous rebase errors present in v12 v12: - Rebase on top of 21.02 - Rework of power intrinsics code Anatoly Burakov (5): eal: uninline power intrinsics eal: avoid invalid API usage in power intrinsics eal: change API of power intrinsics eal: remove sync version of power monitor eal: add monitor wakeup function Liang Ma (6): ethdev: add simple power management API power: add PMD power management API and callback net/ixgbe: implement power management API net/i40e: implement power management API net/ice: implement power management API examples/l3fwd-power: enable PMD power mgmt doc/guides/prog_guide/power_man.rst | 44 +++ doc/guides/rel_notes/release_21_02.rst | 15 + .../sample_app_ug/l3_forward_power_man.rst | 35 ++ drivers/event/dlb/dlb.c | 10 +- drivers/event/dlb2/dlb2.c | 10 +- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 25 ++ drivers/net/i40e/i40e_rxtx.h | 1 + drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++ drivers/net/ice/ice_rxtx.h | 1 + drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 ++ drivers/net/ixgbe/ixgbe_rxtx.h | 1 + examples/l3fwd-power/main.c | 89 ++++- .../arm/include/rte_power_intrinsics.h | 40 -- lib/librte_eal/arm/meson.build | 1 + lib/librte_eal/arm/rte_power_intrinsics.c | 34 ++ .../include/generic/rte_power_intrinsics.h | 78 ++-- .../ppc/include/rte_power_intrinsics.h | 40 -- lib/librte_eal/ppc/meson.build | 1 + lib/librte_eal/ppc/rte_power_intrinsics.c | 34 ++ lib/librte_eal/version.map | 5 + .../x86/include/rte_power_intrinsics.h | 115 ------ lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 184 +++++++++ lib/librte_ethdev/rte_ethdev.c | 28 ++ lib/librte_ethdev/rte_ethdev.h | 25 ++ lib/librte_ethdev/rte_ethdev_driver.h | 22 ++ lib/librte_ethdev/version.map | 3 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 359 ++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++ lib/librte_power/version.map | 5 + 34 files changed, 1096 insertions(+), 259 deletions(-) create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-12 16:09 ` Ananyev, Konstantin 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 02/11] eal: avoid invalid API usage in " Anatoly Burakov ` (10 subsequent siblings) 11 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, power intrinsics are inline functions. Make them part of the ABI so that we can have various internal data associated with them without exposing said data to the outside world. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v14: - Fix compile issues on ARM and PPC64 by moving implementations to .c files .../arm/include/rte_power_intrinsics.h | 40 ------ lib/librte_eal/arm/meson.build | 1 + lib/librte_eal/arm/rte_power_intrinsics.c | 42 ++++++ .../include/generic/rte_power_intrinsics.h | 6 +- .../ppc/include/rte_power_intrinsics.h | 40 ------ lib/librte_eal/ppc/meson.build | 1 + lib/librte_eal/ppc/rte_power_intrinsics.c | 42 ++++++ lib/librte_eal/version.map | 5 + .../x86/include/rte_power_intrinsics.h | 115 ----------------- lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 120 ++++++++++++++++++ 11 files changed, 215 insertions(+), 198 deletions(-) create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h index a4a1bc1159..9e498e9ebf 100644 --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h @@ -13,46 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - RTE_SET_USED(tsc_timestamp); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build index d62875ebae..6ec53ea03a 100644 --- a/lib/librte_eal/arm/meson.build +++ b/lib/librte_eal/arm/meson.build @@ -7,4 +7,5 @@ sources += files( 'rte_cpuflags.c', 'rte_cycles.c', 'rte_hypervisor.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c new file mode 100644 index 0000000000..e5a49facb4 --- /dev/null +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -0,0 +1,42 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2021 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +/** + * This function is not supported on ARM. + */ +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on ARM. + */ +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(lck); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on ARM. + */ +void rte_power_pause(const uint64_t tsc_timestamp) +{ + RTE_SET_USED(tsc_timestamp); +} diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index dd520d90fa..67977bd511 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -52,7 +52,7 @@ * to undefined result. */ __rte_experimental -static inline void rte_power_monitor(const volatile void *p, +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz); @@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p, * wakes up. */ __rte_experimental -static inline void rte_power_monitor_sync(const volatile void *p, +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck); @@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p, * architecture-dependent. */ __rte_experimental -static inline void rte_power_pause(const uint64_t tsc_timestamp); +void rte_power_pause(const uint64_t tsc_timestamp); #endif /* _RTE_POWER_INTRINSIC_H_ */ diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h index 4ed03d521f..c0e9ac279f 100644 --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h @@ -13,46 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - RTE_SET_USED(tsc_timestamp); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/ppc/meson.build b/lib/librte_eal/ppc/meson.build index f4b6d95c42..43c46542fb 100644 --- a/lib/librte_eal/ppc/meson.build +++ b/lib/librte_eal/ppc/meson.build @@ -7,4 +7,5 @@ sources += files( 'rte_cpuflags.c', 'rte_cycles.c', 'rte_hypervisor.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c new file mode 100644 index 0000000000..785effabe6 --- /dev/null +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -0,0 +1,42 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2021 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +/** + * This function is not supported on PPC64. + */ +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on PPC64. + */ +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(lck); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on PPC64. + */ +void rte_power_pause(const uint64_t tsc_timestamp) +{ + RTE_SET_USED(tsc_timestamp); +} diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 354c068f31..31bf76ae81 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -403,6 +403,11 @@ EXPERIMENTAL { rte_service_lcore_may_be_active; rte_vect_get_max_simd_bitwidth; rte_vect_set_max_simd_bitwidth; + + # added in 21.02 + rte_power_monitor; + rte_power_monitor_sync; + rte_power_pause; }; INTERNAL { diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h index c7d790c854..e4c2b87f73 100644 --- a/lib/librte_eal/x86/include/rte_power_intrinsics.h +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h @@ -13,121 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -static inline uint64_t -__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz) -{ - switch (sz) { - case sizeof(uint8_t): - return *(const volatile uint8_t *)p; - case sizeof(uint16_t): - return *(const volatile uint16_t *)p; - case sizeof(uint32_t): - return *(const volatile uint32_t *)p; - case sizeof(uint64_t): - return *(const volatile uint64_t *)p; - default: - /* this is an intrinsic, so we can't have any error handling */ - RTE_ASSERT(0); - return 0; - } -} - -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(p)); - - if (value_mask) { - const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; - - /* if the masked value is already matching, abort */ - if (masked == expected_value) - return; - } - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); -} - -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(p)); - - if (value_mask) { - const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; - - /* if the masked value is already matching, abort */ - if (masked == expected_value) - return; - } - rte_spinlock_unlock(lck); - - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); - - rte_spinlock_lock(lck); -} - -/** - * This function uses TPAUSE instruction and will enter C0.2 state. For more - * information about usage of this instruction, please refer to Intel(R) 64 and - * IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - - /* execute TPAUSE */ - asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build index e78f29002e..dfd42dee0c 100644 --- a/lib/librte_eal/x86/meson.build +++ b/lib/librte_eal/x86/meson.build @@ -8,4 +8,5 @@ sources += files( 'rte_cycles.c', 'rte_hypervisor.c', 'rte_spinlock.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c new file mode 100644 index 0000000000..34c5fd9c3e --- /dev/null +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -0,0 +1,120 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +static inline uint64_t +__get_umwait_val(const volatile void *p, const uint8_t sz) +{ + switch (sz) { + case sizeof(uint8_t): + return *(const volatile uint8_t *)p; + case sizeof(uint16_t): + return *(const volatile uint16_t *)p; + case sizeof(uint32_t): + return *(const volatile uint32_t *)p; + case sizeof(uint64_t): + return *(const volatile uint64_t *)p; + default: + /* this is an intrinsic, so we can't have any error handling */ + RTE_ASSERT(0); + return 0; + } +} + +/** + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. + * For more information about usage of these instructions, please refer to + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + + if (value_mask) { + const uint64_t cur_value = __get_umwait_val(p, data_sz); + const uint64_t masked = cur_value & value_mask; + + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return; + } + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} + +/** + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. + * For more information about usage of these instructions, please refer to + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + + if (value_mask) { + const uint64_t cur_value = __get_umwait_val(p, data_sz); + const uint64_t masked = cur_value & value_mask; + + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return; + } + rte_spinlock_unlock(lck); + + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); + + rte_spinlock_lock(lck); +} + +/** + * This function uses TPAUSE instruction and will enter C0.2 state. For more + * information about usage of this instruction, please refer to Intel(R) 64 and + * IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_pause(const uint64_t tsc_timestamp) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* execute TPAUSE */ + asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics Anatoly Burakov @ 2021-01-12 16:09 ` Ananyev, Konstantin 2021-01-12 16:14 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-12 16:09 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, McDaniel, Timothy, Hunt, David, Macnamara, Chris > diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c > new file mode 100644 > index 0000000000..34c5fd9c3e > --- /dev/null > +++ b/lib/librte_eal/x86/rte_power_intrinsics.c > @@ -0,0 +1,120 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2020 Intel Corporation > + */ > + > +#include "rte_power_intrinsics.h" > + > +static inline uint64_t > +__get_umwait_val(const volatile void *p, const uint8_t sz) > +{ > + switch (sz) { > + case sizeof(uint8_t): > + return *(const volatile uint8_t *)p; > + case sizeof(uint16_t): > + return *(const volatile uint16_t *)p; > + case sizeof(uint32_t): > + return *(const volatile uint32_t *)p; > + case sizeof(uint64_t): > + return *(const volatile uint64_t *)p; > + default: > + /* this is an intrinsic, so we can't have any error handling */ > + RTE_ASSERT(0); > + return 0; Nearly forgot - as now this function is not inline anymore, we can probably get rid of assert and return some error code instead? > + } > +} > + ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics 2021-01-12 16:09 ` Ananyev, Konstantin @ 2021-01-12 16:14 ` Burakov, Anatoly 0 siblings, 0 replies; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-12 16:14 UTC (permalink / raw) To: Ananyev, Konstantin, dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, McDaniel, Timothy, Hunt, David, Macnamara, Chris On 12-Jan-21 4:09 PM, Ananyev, Konstantin wrote: > >> diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c >> new file mode 100644 >> index 0000000000..34c5fd9c3e >> --- /dev/null >> +++ b/lib/librte_eal/x86/rte_power_intrinsics.c >> @@ -0,0 +1,120 @@ >> +/* SPDX-License-Identifier: BSD-3-Clause >> + * Copyright(c) 2020 Intel Corporation >> + */ >> + >> +#include "rte_power_intrinsics.h" >> + >> +static inline uint64_t >> +__get_umwait_val(const volatile void *p, const uint8_t sz) >> +{ >> +switch (sz) { >> +case sizeof(uint8_t): >> +return *(const volatile uint8_t *)p; >> +case sizeof(uint16_t): >> +return *(const volatile uint16_t *)p; >> +case sizeof(uint32_t): >> +return *(const volatile uint32_t *)p; >> +case sizeof(uint64_t): >> +return *(const volatile uint64_t *)p; >> +default: >> +/* this is an intrinsic, so we can't have any error handling */ >> +RTE_ASSERT(0); >> +return 0; > > Nearly forgot - as now this function is not inline anymore, we can probably > get rid of assert and return some error code instead? > Well, this would necessitate a change of API to include return values. Which i think is OK at this point, because it's a fully fledged API (rather than an intrinsic) at this point anyway. >> +} >> +} >> + -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 02/11] eal: avoid invalid API usage in power intrinsics 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 03/11] eal: change API of " Anatoly Burakov ` (9 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, the API documentation mandates that if the user wants to use the power management intrinsics, they need to call the `rte_cpu_get_intrinsics_support` API and check support for specific intrinsics. However, if the user does not do that, it is possible to get illegal instruction error because we're using raw instruction opcodes, which may or may not be supported at runtime. Now that we have everything in a C file, we can check for support at startup and prevent the user from possibly encountering illegal instruction errors. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v15: - Remove accidental whitespace changes v14: - Replace uint8_t with bool v14: - Replace uint8_t with bool .../include/generic/rte_power_intrinsics.h | 3 --- lib/librte_eal/x86/rte_power_intrinsics.c | 25 +++++++++++++++++++ 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 67977bd511..ffa72f7578 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -34,7 +34,6 @@ * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param p * Address to monitor for changes. @@ -75,7 +74,6 @@ void rte_power_monitor(const volatile void *p, * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param p * Address to monitor for changes. @@ -111,7 +109,6 @@ void rte_power_monitor_sync(const volatile void *p, * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 34c5fd9c3e..a164ad55fc 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -4,6 +4,8 @@ #include "rte_power_intrinsics.h" +static bool wait_supported; + static inline uint64_t __get_umwait_val(const volatile void *p, const uint8_t sz) { @@ -35,6 +37,11 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -72,6 +79,11 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -112,9 +124,22 @@ rte_power_pause(const uint64_t tsc_timestamp) const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + /* execute TPAUSE */ asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" : /* ignore rflags */ : "D"(0), /* enter C0.2 */ "a"(tsc_l), "d"(tsc_h)); } + +RTE_INIT(rte_power_intrinsics_init) { + struct rte_cpu_intrinsics i; + + rte_cpu_get_intrinsics_support(&i); + + if (i.power_monitor && i.power_pause) + wait_supported = 1; +} -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 03/11] eal: change API of power intrinsics 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 01/11] eal: uninline power intrinsics Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 02/11] eal: avoid invalid API usage in " Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 04/11] eal: remove sync version of power monitor Anatoly Burakov ` (8 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Timothy McDaniel, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Bruce Richardson, Konstantin Ananyev, thomas, david.hunt, chris.macnamara Instead of passing around pointers and integers, collect everything into struct. This makes API design around these intrinsics much easier. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- drivers/event/dlb/dlb.c | 10 ++-- drivers/event/dlb2/dlb2.c | 10 ++-- lib/librte_eal/arm/rte_power_intrinsics.c | 25 ++++------ .../include/generic/rte_power_intrinsics.h | 49 ++++++++----------- lib/librte_eal/ppc/rte_power_intrinsics.c | 25 ++++------ lib/librte_eal/x86/rte_power_intrinsics.c | 32 ++++++------ 6 files changed, 70 insertions(+), 81 deletions(-) diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c index 0c95c4793d..d2f2026291 100644 --- a/drivers/event/dlb/dlb.c +++ b/drivers/event/dlb/dlb.c @@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb, /* Interrupts not supported by PF PMD */ return 1; } else if (dlb->umwait_allowed) { + struct rte_power_monitor_cond pmc; volatile struct dlb_dequeue_qe *cq_base; union { uint64_t raw_qe[2]; @@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb, else expected_value = 0; - rte_power_monitor(monitor_addr, expected_value, - qe_mask.raw_qe[1], timeout + start_ticks, - sizeof(uint64_t)); + pmc.addr = monitor_addr; + pmc.val = expected_value; + pmc.mask = qe_mask.raw_qe[1]; + pmc.data_sz = sizeof(uint64_t); + + rte_power_monitor(&pmc, timeout + start_ticks); DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1); } else { diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c index 86724863f2..c9a8a02278 100644 --- a/drivers/event/dlb2/dlb2.c +++ b/drivers/event/dlb2/dlb2.c @@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, if (elapsed_ticks >= timeout) { return 1; } else if (dlb2->umwait_allowed) { + struct rte_power_monitor_cond pmc; volatile struct dlb2_dequeue_qe *cq_base; union { uint64_t raw_qe[2]; @@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, else expected_value = 0; - rte_power_monitor(monitor_addr, expected_value, - qe_mask.raw_qe[1], timeout + start_ticks, - sizeof(uint64_t)); + pmc.addr = monitor_addr; + pmc.val = expected_value; + pmc.mask = qe_mask.raw_qe[1]; + pmc.data_sz = sizeof(uint64_t); + + rte_power_monitor(&pmc, timeout + start_ticks); DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1); } else { diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index e5a49facb4..f2c3506b90 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -7,36 +7,31 @@ /** * This function is not supported on ARM. */ -void rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +void +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); } /** * This function is not supported on ARM. */ -void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +void +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); - RTE_SET_USED(data_sz); } /** * This function is not supported on ARM. */ -void rte_power_pause(const uint64_t tsc_timestamp) +void +rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); } diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index ffa72f7578..00c670cb50 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -18,6 +18,18 @@ * which are architecture-dependent. */ +struct rte_power_monitor_cond { + volatile void *addr; /**< Address to monitor for changes */ + uint64_t val; /**< Before attempting the monitoring, the address + * may be read and compared against this value. + **/ + uint64_t mask; /**< 64-bit mask to extract current value from addr */ + uint8_t data_sz; /**< Data size (in bytes) that will be used to compare + * expected value with the memory address. Can be 1, + * 2, 4, or 8. Supplying any other value will lead to + * undefined result. */ +}; + /** * @warning * @b EXPERIMENTAL: this API may change without prior notice @@ -35,25 +47,15 @@ * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. * - * @param p - * Address to monitor for changes. - * @param expected_value - * Before attempting the monitoring, the `p` address may be read and compared - * against this value. If `value_mask` is zero, this step will be skipped. - * @param value_mask - * The 64-bit mask to use to extract current value from `p`. + * @param pmc + * The monitoring condition structure. * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. - * @param data_sz - * Data size (in bytes) that will be used to compare expected value with the - * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead - * to undefined result. */ __rte_experimental -void rte_power_monitor(const volatile void *p, - const uint64_t expected_value, const uint64_t value_mask, - const uint64_t tsc_timestamp, const uint8_t data_sz); +void rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp); /** * @warning @@ -75,30 +77,19 @@ void rte_power_monitor(const volatile void *p, * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. * - * @param p - * Address to monitor for changes. - * @param expected_value - * Before attempting the monitoring, the `p` address may be read and compared - * against this value. If `value_mask` is zero, this step will be skipped. - * @param value_mask - * The 64-bit mask to use to extract current value from `p`. + * @param pmc + * The monitoring condition structure. * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. - * @param data_sz - * Data size (in bytes) that will be used to compare expected value with the - * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead - * to undefined result. * @param lck * A spinlock that must be locked before entering the function, will be * unlocked while the CPU is sleeping, and will be locked again once the CPU * wakes up. */ __rte_experimental -void rte_power_monitor_sync(const volatile void *p, - const uint64_t expected_value, const uint64_t value_mask, - const uint64_t tsc_timestamp, const uint8_t data_sz, - rte_spinlock_t *lck); +void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck); /** * @warning diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 785effabe6..3897d2024d 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -7,36 +7,31 @@ /** * This function is not supported on PPC64. */ -void rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +void +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); } /** * This function is not supported on PPC64. */ -void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +void +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); - RTE_SET_USED(data_sz); } /** * This function is not supported on PPC64. */ -void rte_power_pause(const uint64_t tsc_timestamp) +void +rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); } diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index a164ad55fc..9b0638148d 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -31,9 +31,8 @@ __get_umwait_val(const volatile void *p, const uint8_t sz) * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); @@ -50,14 +49,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, /* set address for UMONITOR */ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" : - : "D"(p)); + : "D"(pmc->addr)); - if (value_mask) { - const uint64_t cur_value = __get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; + if (pmc->mask) { + const uint64_t cur_value = __get_umwait_val( + pmc->addr, pmc->data_sz); + const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == expected_value) + if (masked == pmc->val) return; } /* execute UMWAIT */ @@ -73,9 +73,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); @@ -92,14 +91,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, /* set address for UMONITOR */ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" : - : "D"(p)); + : "D"(pmc->addr)); - if (value_mask) { - const uint64_t cur_value = __get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; + if (pmc->mask) { + const uint64_t cur_value = __get_umwait_val( + pmc->addr, pmc->data_sz); + const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == expected_value) + if (masked == pmc->val) return; } rte_spinlock_unlock(lck); -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 04/11] eal: remove sync version of power monitor 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov ` (2 preceding siblings ...) 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 03/11] eal: change API of " Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 05/11] eal: add monitor wakeup function Anatoly Burakov ` (7 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, the "sync" version of power monitor intrinsic is supposed to be used for purposes of waking up a sleeping core. However, there are better ways to achieve the same result, so remove the unneeded function. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- lib/librte_eal/arm/rte_power_intrinsics.c | 12 ----- .../include/generic/rte_power_intrinsics.h | 34 -------------- lib/librte_eal/ppc/rte_power_intrinsics.c | 12 ----- lib/librte_eal/version.map | 1 - lib/librte_eal/x86/rte_power_intrinsics.c | 46 ------------------- 5 files changed, 105 deletions(-) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index f2c3506b90..6b8219b919 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -15,18 +15,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, RTE_SET_USED(tsc_timestamp); } -/** - * This function is not supported on ARM. - */ -void -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - RTE_SET_USED(pmc); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); -} - /** * This function is not supported on ARM. */ diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 00c670cb50..a6f1955996 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -57,40 +57,6 @@ __rte_experimental void rte_power_monitor(const struct rte_power_monitor_cond *pmc, const uint64_t tsc_timestamp); -/** - * @warning - * @b EXPERIMENTAL: this API may change without prior notice - * - * Monitor specific address for changes. This will cause the CPU to enter an - * architecture-defined optimized power state until either the specified - * memory address is written to, a certain TSC timestamp is reached, or other - * reasons cause the CPU to wake up. - * - * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If - * mask is non-zero, the current value pointed to by the `p` pointer will be - * checked against the expected value, and if they match, the entering of - * optimized power state may be aborted. - * - * This call will also lock a spinlock on entering sleep, and release it on - * waking up the CPU. - * - * @warning It is responsibility of the user to check if this function is - * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * - * @param pmc - * The monitoring condition structure. - * @param tsc_timestamp - * Maximum TSC timestamp to wait for. Note that the wait behavior is - * architecture-dependent. - * @param lck - * A spinlock that must be locked before entering the function, will be - * unlocked while the CPU is sleeping, and will be locked again once the CPU - * wakes up. - */ -__rte_experimental -void rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck); - /** * @warning * @b EXPERIMENTAL: this API may change without prior notice diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 3897d2024d..9a40c4d5d6 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -15,18 +15,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, RTE_SET_USED(tsc_timestamp); } -/** - * This function is not supported on PPC64. - */ -void -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - RTE_SET_USED(pmc); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); -} - /** * This function is not supported on PPC64. */ diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 31bf76ae81..20945b1efa 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -406,7 +406,6 @@ EXPERIMENTAL { # added in 21.02 rte_power_monitor; - rte_power_monitor_sync; rte_power_pause; }; diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 9b0638148d..487a783a2c 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -67,52 +67,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, "a"(tsc_l), "d"(tsc_h)); } -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -void -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - - /* prevent user from running this instruction if it's not supported */ - if (!wait_supported) - return; - - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(pmc->addr)); - - if (pmc->mask) { - const uint64_t cur_value = __get_umwait_val( - pmc->addr, pmc->data_sz); - const uint64_t masked = cur_value & pmc->mask; - - /* if the masked value is already matching, abort */ - if (masked == pmc->val) - return; - } - rte_spinlock_unlock(lck); - - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); - - rte_spinlock_lock(lck); -} - /** * This function uses TPAUSE instruction and will enter C0.2 state. For more * information about usage of this instruction, please refer to Intel(R) 64 and -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 05/11] eal: add monitor wakeup function 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov ` (3 preceding siblings ...) 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 04/11] eal: remove sync version of power monitor Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 06/11] ethdev: add simple power management API Anatoly Burakov ` (6 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Now that we have everything in a C file, we can store the information about our sleep, and have a native mechanism to wake up the sleeping core. This mechanism would however only wake up a core that's sleeping while monitoring - waking up from `rte_power_pause` won't work. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v13: - Add comments around wakeup code to explain what it does - Add lcore_id parameter checking to prevent buffer overrun lib/librte_eal/arm/rte_power_intrinsics.c | 9 ++ .../include/generic/rte_power_intrinsics.h | 16 ++++ lib/librte_eal/ppc/rte_power_intrinsics.c | 9 ++ lib/librte_eal/version.map | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 85 +++++++++++++++++++ 5 files changed, 120 insertions(+) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index 6b8219b919..14081a2c5b 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -23,3 +23,12 @@ rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); } + +/** + * This function is not supported on ARM. + */ +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + RTE_SET_USED(lcore_id); +} diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index a6f1955996..e311d6f8ea 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -57,6 +57,22 @@ __rte_experimental void rte_power_monitor(const struct rte_power_monitor_cond *pmc, const uint64_t tsc_timestamp); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Wake up a specific lcore that is in a power optimized state and is monitoring + * an address. + * + * @note This function will *not* wake up a core that is in a power optimized + * state due to calling `rte_power_pause`. + * + * @param lcore_id + * Lcore ID of a sleeping thread. + */ +__rte_experimental +void rte_power_monitor_wakeup(const unsigned int lcore_id); + /** * @warning * @b EXPERIMENTAL: this API may change without prior notice diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 9a40c4d5d6..a7db61a7c3 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -23,3 +23,12 @@ rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); } + +/** + * This function is not supported on PPC64. + */ +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + RTE_SET_USED(lcore_id); +} diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 20945b1efa..ac026e289d 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -406,6 +406,7 @@ EXPERIMENTAL { # added in 21.02 rte_power_monitor; + rte_power_monitor_wakeup; rte_power_pause; }; diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 487a783a2c..941da138ce 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -2,8 +2,31 @@ * Copyright(c) 2020 Intel Corporation */ +#include <rte_common.h> +#include <rte_lcore.h> +#include <rte_spinlock.h> + #include "rte_power_intrinsics.h" +/* + * Per-lcore structure holding current status of C0.2 sleeps. + */ +static struct power_wait_status { + rte_spinlock_t lock; + volatile void *monitor_addr; /**< NULL if not currently sleeping */ +} __rte_cache_aligned wait_status[RTE_MAX_LCORE]; + +static inline void +__umwait_wakeup(volatile void *addr) +{ + uint64_t val; + + /* trigger a write but don't change the value */ + val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED); + __atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0, + __ATOMIC_RELAXED, __ATOMIC_RELAXED); +} + static bool wait_supported; static inline uint64_t @@ -36,6 +59,12 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + const unsigned int lcore_id = rte_lcore_id(); + struct power_wait_status *s; + + /* prevent non-EAL thread from using this API */ + if (lcore_id >= RTE_MAX_LCORE) + return; /* prevent user from running this instruction if it's not supported */ if (!wait_supported) @@ -60,11 +89,24 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, if (masked == pmc->val) return; } + + s = &wait_status[lcore_id]; + + /* update sleep address */ + rte_spinlock_lock(&s->lock); + s->monitor_addr = pmc->addr; + rte_spinlock_unlock(&s->lock); + /* execute UMWAIT */ asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" : /* ignore rflags */ : "D"(0), /* enter C0.2 */ "a"(tsc_l), "d"(tsc_h)); + + /* erase sleep address */ + rte_spinlock_lock(&s->lock); + s->monitor_addr = NULL; + rte_spinlock_unlock(&s->lock); } /** @@ -97,3 +139,46 @@ RTE_INIT(rte_power_intrinsics_init) { if (i.power_monitor && i.power_pause) wait_supported = 1; } + +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + struct power_wait_status *s; + + /* prevent buffer overrun */ + if (lcore_id >= RTE_MAX_LCORE) + return; + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return; + + s = &wait_status[lcore_id]; + + /* + * There is a race condition between sleep, wakeup and locking, but we + * don't need to handle it. + * + * Possible situations: + * + * 1. T1 locks, sets address, unlocks + * 2. T2 locks, triggers wakeup, unlocks + * 3. T1 sleeps + * + * In this case, because T1 has already set the address for monitoring, + * we will wake up immediately even if T2 triggers wakeup before T1 + * goes to sleep. + * + * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up + * 2. T2 locks, triggers wakeup, and unlocks + * 3. T1 locks, erases address, and unlocks + * + * In this case, since we've already woken up, the "wakeup" was + * unneeded, and since T1 is still waiting on T2 releasing the lock, the + * wakeup address is still valid so it's perfectly safe to write it. + */ + rte_spinlock_lock(&s->lock); + if (s->monitor_addr != NULL) + __umwait_wakeup(s->monitor_addr); + rte_spinlock_unlock(&s->lock); +} -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 06/11] ethdev: add simple power management API 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov ` (4 preceding siblings ...) 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 05/11] eal: add monitor wakeup function Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 07/11] power: add PMD power management API and callback Anatoly Burakov ` (5 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella, Neil Horman, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add a simple API to allow getting the monitor conditions for power-optimized monitoring of the Rx queues from the PMD, as well as release notes information. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> --- Notes: v13: - Fix typos and issues raised by Andrew doc/guides/rel_notes/release_21_02.rst | 5 +++++ lib/librte_ethdev/rte_ethdev.c | 28 ++++++++++++++++++++++++++ lib/librte_ethdev/rte_ethdev.h | 25 +++++++++++++++++++++++ lib/librte_ethdev/rte_ethdev_driver.h | 22 ++++++++++++++++++++ lib/librte_ethdev/version.map | 3 +++ 5 files changed, 83 insertions(+) diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst index 638f98168b..6de0cb568e 100644 --- a/doc/guides/rel_notes/release_21_02.rst +++ b/doc/guides/rel_notes/release_21_02.rst @@ -55,6 +55,11 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **ethdev: added new API for PMD power management** + + * ``rte_eth_get_monitor_addr()``, to be used in conjunction with + ``rte_power_monitor()`` to enable automatic power management for PMD's. + Removed Items ------------- diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c index 17ddacc78d..e19dbd838b 100644 --- a/lib/librte_ethdev/rte_ethdev.c +++ b/lib/librte_ethdev/rte_ethdev.c @@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode)); } +int +rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id, + struct rte_power_monitor_cond *pmc) +{ + struct rte_eth_dev *dev; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV); + + dev = &rte_eth_devices[port_id]; + + RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP); + + if (queue_id >= dev->data->nb_rx_queues) { + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id); + return -EINVAL; + } + + if (pmc == NULL) { + RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n", + pmc); + return -EINVAL; + } + + return eth_err(port_id, + dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id], + pmc)); +} + int rte_eth_dev_set_mc_addr_list(uint16_t port_id, struct rte_ether_addr *mc_addr_set, diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h index f5f8919186..ca0f91312e 100644 --- a/lib/librte_ethdev/rte_ethdev.h +++ b/lib/librte_ethdev/rte_ethdev.h @@ -157,6 +157,7 @@ extern "C" { #include <rte_common.h> #include <rte_config.h> #include <rte_ether.h> +#include <rte_power_intrinsics.h> #include "rte_ethdev_trace_fp.h" #include "rte_dev_info.h" @@ -4334,6 +4335,30 @@ __rte_experimental int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, struct rte_eth_burst_mode *mode); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Retrieve the monitor condition for a given receive queue. + * + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The Rx queue on the Ethernet device for which information + * will be retrieved. + * @param pmc + * The pointer point to power-optimized monitoring condition structure. + * + * @return + * - 0: Success. + * -ENOTSUP: Operation not supported. + * -EINVAL: Invalid parameters. + * -ENODEV: Invalid port ID. + */ +__rte_experimental +int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id, + struct rte_power_monitor_cond *pmc); + /** * Retrieve device registers and register attributes (number of registers and * register size) diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h index 0eacfd8425..3b3b0ec1a0 100644 --- a/lib/librte_ethdev/rte_ethdev_driver.h +++ b/lib/librte_ethdev/rte_ethdev_driver.h @@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t) (struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction); /**< @internal Unbind peer queue from the current queue. */ +/** + * @internal + * Get address of memory location whose contents will change whenever there is + * new data to be received on an Rx queue. + * + * @param rxq + * Ethdev queue pointer. + * @param pmc + * The pointer to power-optimized monitoring condition structure. + * @return + * Negative errno value on error, 0 on success. + * + * @retval 0 + * Success + * @retval -EINVAL + * Invalid parameters + */ +typedef int (*eth_get_monitor_addr_t)(void *rxq, + struct rte_power_monitor_cond *pmc); + /** * @internal A structure containing the functions exported by an Ethernet driver. */ @@ -917,6 +937,8 @@ struct eth_dev_ops { /**< Set up the connection between the pair of hairpin queues. */ eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind; /**< Disconnect the hairpin queues of a pair from each other. */ + eth_get_monitor_addr_t get_monitor_addr; + /**< Get power monitoring condition for Rx queue. */ }; /** diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map index d3f5410806..a124e1e370 100644 --- a/lib/librte_ethdev/version.map +++ b/lib/librte_ethdev/version.map @@ -240,6 +240,9 @@ EXPERIMENTAL { rte_flow_get_restore_info; rte_flow_tunnel_action_decap_release; rte_flow_tunnel_item_release; + + # added in 21.02 + rte_eth_get_monitor_addr; }; INTERNAL { -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 07/11] power: add PMD power management API and callback 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov ` (5 preceding siblings ...) 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 06/11] ethdev: add simple power management API Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 08/11] net/ixgbe: implement power management API Anatoly Burakov ` (4 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas, konstantin.ananyev, timothy.mcdaniel, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add a simple on/off switch that will enable saving power when no packets are arriving. It is based on counting the number of empty polls and, when the number reaches a certain threshold, entering an architecture-defined optimized power state that will either wait until a TSC timestamp expires, or when packets arrive. This API mandates a core-to-single-queue mapping (that is, multiple queued per device are supported, but they have to be polled on different cores). This design is using PMD RX callbacks. 1. UMWAIT/UMONITOR: When a certain threshold of empty polls is reached, the core will go into a power optimized sleep while waiting on an address of next RX descriptor to be written to. 2. TPAUSE/Pause instruction This method uses the pause (or TPAUSE, if available) instruction to avoid busy polling. 3. Frequency scaling Reuse existing DPDK power library to scale up/down core frequency depending on traffic volume. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v15: - Fix check in UMWAIT callback v13: - Rework the synchronization mechanism to not require locking - Add more parameter checking - Rework n_rx_queues access to not go through internal PMD structures and use public API instead v13: - Rework the synchronization mechanism to not require locking - Add more parameter checking - Rework n_rx_queues access to not go through internal PMD structures and use public API instead doc/guides/prog_guide/power_man.rst | 44 +++ doc/guides/rel_notes/release_21_02.rst | 10 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 359 +++++++++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++++ lib/librte_power/version.map | 5 + 6 files changed, 511 insertions(+), 2 deletions(-) create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index 0a3755a901..02280dd689 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -192,6 +192,47 @@ User Cases ---------- The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA. +PMD Power Management API +------------------------ + +Abstract +~~~~~~~~ +Existing power management mechanisms require developers to change application +design or change code to make use of it. The PMD power management API provides a +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering +power saving whenever empty poll count reaches a certain number. + + * Monitor + + This power saving scheme will put the CPU into optimized power state and use + the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX + descriptor address, and wake the CPU up whenever there's new traffic. + + * Pause + + This power saving scheme will avoid busy polling by either entering + power-optimized sleep state with ``rte_power_pause()`` function, or, if it's + not available, use ``rte_pause()``. + + * Frequency scaling + + This power saving scheme will use existing ``librte_power`` library + functionality to scale the core frequency up/down depending on traffic + volume. + + +.. note:: + + Currently, this power management API is limited to mandatory mapping of 1 + queue to 1 core (multiple queues are supported, but they must be polled from + different cores). + +API Overview for PMD Power Management +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +* **Queue Enable**: Enable specific power scheme for certain queue/port/core + +* **Queue Disable**: Disable power scheme for certain queue/port/core + References ---------- @@ -200,3 +241,6 @@ References * The :doc:`../sample_app_ug/vm_power_management` chapter in the :doc:`../sample_app_ug/index` section. + +* The :doc:`../sample_app_ug/rxtx_callbacks` + chapter in the :doc:`../sample_app_ug/index` section. diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst index 6de0cb568e..b34828cad6 100644 --- a/doc/guides/rel_notes/release_21_02.rst +++ b/doc/guides/rel_notes/release_21_02.rst @@ -60,6 +60,16 @@ New Features * ``rte_eth_get_monitor_addr()``, to be used in conjunction with ``rte_power_monitor()`` to enable automatic power management for PMD's. +* **Add PMD power management helper API** + + A new helper API has been added to make using Ethernet PMD power management + easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power + management schemes are supported initially: + + * Power saving based on UMWAIT instruction (x86 only) + * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only) + * Power saving based on frequency scaling through the ``librte_power`` library + Removed Items ------------- diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build index 4b4cf1b90b..51a471b669 100644 --- a/lib/librte_power/meson.build +++ b/lib/librte_power/meson.build @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c', 'power_kvm_vm.c', 'guest_channel.c', 'rte_power_empty_poll.c', 'power_pstate_cpufreq.c', + 'rte_power_pmd_mgmt.c', 'power_common.c') -headers = files('rte_power.h','rte_power_empty_poll.h') -deps += ['timer'] +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h') +deps += ['timer' ,'ethdev'] diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c new file mode 100644 index 0000000000..470c3a912b --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.c @@ -0,0 +1,359 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <rte_lcore.h> +#include <rte_cycles.h> +#include <rte_cpuflags.h> +#include <rte_malloc.h> +#include <rte_ethdev.h> +#include <rte_power_intrinsics.h> + +#include "rte_power_pmd_mgmt.h" + +#define EMPTYPOLL_MAX 512 + +static struct pmd_conf_data { + struct rte_cpu_intrinsics intrinsics_support; + /**< what do we support? */ + uint64_t tsc_per_us; + /**< pre-calculated tsc diff for 1us */ + uint64_t pause_per_us; + /**< how many rte_pause can we fit in a microisecond? */ +} global_data; + +/** + * Possible power management states of an ethdev port. + */ +enum pmd_mgmt_state { + /** Device power management is disabled. */ + PMD_MGMT_DISABLED = 0, + /** Device power management is enabled. */ + PMD_MGMT_ENABLED, + /** Device powermanagement status is about to change. */ + PMD_MGMT_BUSY +}; + +struct pmd_queue_cfg { + volatile enum pmd_mgmt_state pwr_mgmt_state; + /**< State of power management for this queue */ + enum rte_power_pmd_mgmt_type cb_mode; + /**< Callback mode for this queue */ + const struct rte_eth_rxtx_callback *cur_cb; + /**< Callback instance */ + volatile bool umwait_in_progress; + /**< are we currently sleeping? */ + uint64_t empty_poll_stats; + /**< Number of empty polls */ +} __rte_cache_aligned; + +static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT]; + +static void +calc_tsc(void) +{ + const uint64_t hz = rte_get_timer_hz(); + const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */ + + global_data.tsc_per_us = tsc_per_us; + + /* only do this if we don't have tpause */ + if (!global_data.intrinsics_support.power_pause) { + const uint64_t start = rte_rdtsc_precise(); + const uint32_t n_pauses = 10000; + double us, us_per_pause; + uint64_t end; + unsigned int i; + + /* estimate number of rte_pause() calls per us*/ + for (i = 0; i < n_pauses; i++) + rte_pause(); + + end = rte_rdtsc_precise(); + us = (end - start) / (double)tsc_per_us; + us_per_pause = us / n_pauses; + + global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause); + } +} + +static uint16_t +clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, + uint16_t nb_rx, uint16_t max_pkts __rte_unused, + void *addr __rte_unused) +{ + + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + struct rte_power_monitor_cond pmc; + uint16_t ret; + + /* + * we might get a cancellation request while being + * inside the callback, in which case the wakeup + * wouldn't work because it would've arrived too early. + * + * to get around this, we notify the other thread that + * we're sleeping, so that it can spin until we're done. + * unsolicited wakeups are perfectly safe. + */ + q_conf->umwait_in_progress = true; + + /* check if we need to cancel sleep */ + if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) { + /* use monitoring condition to sleep */ + ret = rte_eth_get_monitor_addr(port_id, qidx, + &pmc); + if (ret == 0) + rte_power_monitor(&pmc, -1ULL); + } + q_conf->umwait_in_progress = false; + } + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, + uint16_t nb_rx, uint16_t max_pkts __rte_unused, + void *addr __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + /* sleep for 1 microsecond */ + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + /* use tpause if we have it */ + if (global_data.intrinsics_support.power_pause) { + const uint64_t cur = rte_rdtsc(); + const uint64_t wait_tsc = + cur + global_data.tsc_per_us; + rte_power_pause(wait_tsc); + } else { + uint64_t i; + for (i = 0; i < global_data.pause_per_us; i++) + rte_pause(); + } + } + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_scale_freq(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *_ __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + /* scale down freq */ + rte_power_freq_min(rte_lcore_id()); + } else { + q_conf->empty_poll_stats = 0; + /* scale up freq */ + rte_power_freq_max(rte_lcore_id()); + } + + return nb_rx; +} + +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, + uint16_t queue_id, enum rte_power_pmd_mgmt_type mode) +{ + struct pmd_queue_cfg *queue_cfg; + struct rte_eth_dev_info info; + int ret; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + + if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) { + ret = -EINVAL; + goto end; + } + + if (rte_eth_dev_info_get(port_id, &info) < 0) { + ret = -EINVAL; + goto end; + } + + /* check if queue id is valid */ + if (queue_id >= info.nb_rx_queues) { + ret = -EINVAL; + goto end; + } + + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) { + ret = -EINVAL; + goto end; + } + + /* we're about to change our state */ + queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; + + /* we need this in various places */ + rte_cpu_get_intrinsics_support(&global_data.intrinsics_support); + + switch (mode) { + case RTE_POWER_MGMT_TYPE_MONITOR: + { + struct rte_power_monitor_cond dummy; + + /* check if rte_power_monitor is supported */ + if (!global_data.intrinsics_support.power_monitor) { + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); + ret = -ENOTSUP; + goto rollback; + } + + /* check if the device supports the necessary PMD API */ + if (rte_eth_get_monitor_addr(port_id, queue_id, + &dummy) == -ENOTSUP) { + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n"); + ret = -ENOTSUP; + goto rollback; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->umwait_in_progress = false; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_umwait, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_SCALE: + { + enum power_management_env env; + /* only PSTATE and ACPI modes are supported */ + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && + !rte_power_check_env_supported( + PM_ENV_PSTATE_CPUFREQ)) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); + ret = -ENOTSUP; + goto rollback; + } + /* ensure we could initialize the power library */ + if (rte_power_init(lcore_id)) { + ret = -EINVAL; + goto rollback; + } + /* ensure we initialized the correct env */ + env = rte_power_get_env(); + if (env != PM_ENV_ACPI_CPUFREQ && + env != PM_ENV_PSTATE_CPUFREQ) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); + ret = -ENOTSUP; + goto rollback; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, + queue_id, clb_scale_freq, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_PAUSE: + /* figure out various time-to-tsc conversions */ + if (global_data.tsc_per_us == 0) + calc_tsc(); + + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_pause, NULL); + break; + } + ret = 0; + + return ret; + +rollback: + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; +end: + return ret; +} + +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id) +{ + struct pmd_queue_cfg *queue_cfg; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + + if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT) + return -EINVAL; + + /* no need to check queue id as wrong queue id would not be enabled */ + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) + return -EINVAL; + + /* let the callback know we're shutting down */ + queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; + + switch (queue_cfg->cb_mode) { + case RTE_POWER_MGMT_TYPE_MONITOR: + { + bool exit = false; + do { + /* + * we may request cancellation while the other thread + * has just entered the callback but hasn't started + * sleeping yet, so keep waking it up until we know it's + * done sleeping. + */ + if (queue_cfg->umwait_in_progress) + rte_power_monitor_wakeup(lcore_id); + else + exit = true; + } while (!exit); + } + /* fall-through */ + case RTE_POWER_MGMT_TYPE_PAUSE: + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + break; + case RTE_POWER_MGMT_TYPE_SCALE: + rte_power_freq_max(lcore_id); + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + rte_power_exit(lcore_id); + break; + } + /* + * we don't free the RX callback here because it is unsafe to do so + * unless we know for a fact that all data plane threads have stopped. + */ + queue_cfg->cur_cb = NULL; + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + + return 0; +} diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h new file mode 100644 index 0000000000..0bfbc6ba69 --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#ifndef _RTE_POWER_PMD_MGMT_H +#define _RTE_POWER_PMD_MGMT_H + +/** + * @file + * RTE PMD Power Management + */ +#include <stdint.h> +#include <stdbool.h> + +#include <rte_common.h> +#include <rte_byteorder.h> +#include <rte_log.h> +#include <rte_power.h> +#include <rte_atomic.h> + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * PMD Power Management Type + */ +enum rte_power_pmd_mgmt_type { + /** Use power-optimized monitoring to wait for incoming traffic */ + RTE_POWER_MGMT_TYPE_MONITOR = 1, + /** Use power-optimized sleep to avoid busy polling */ + RTE_POWER_MGMT_TYPE_PAUSE, + /** Use frequency scaling when traffic is low */ + RTE_POWER_MGMT_TYPE_SCALE, +}; + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Enable power management on a specified RX queue and lcore. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @param mode + * The power management callback function type. + + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode); + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Disable power management on a specified RX queue and lcore. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id); +#ifdef __cplusplus +} +#endif + +#endif diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map index 69ca9af616..61996b4d11 100644 --- a/lib/librte_power/version.map +++ b/lib/librte_power/version.map @@ -34,4 +34,9 @@ EXPERIMENTAL { rte_power_guest_channel_receive_msg; rte_power_poll_stat_fetch; rte_power_poll_stat_update; + + # added in 21.02 + rte_power_pmd_mgmt_queue_enable; + rte_power_pmd_mgmt_queue_disable; + }; -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 08/11] net/ixgbe: implement power management API 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov ` (6 preceding siblings ...) 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 07/11] power: add PMD power management API and callback Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 09/11] net/i40e: " Anatoly Burakov ` (3 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Liang Ma, Jeff Guo, Haiyue Wang, thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: Liang Ma <liang.j.ma@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 +++++++++++++++++++++++++ drivers/net/ixgbe/ixgbe_rxtx.h | 1 + 3 files changed, 27 insertions(+) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index 9a47a8b262..4b7a5ca60b 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -560,6 +560,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = { .udp_tunnel_port_del = ixgbe_dev_udp_tunnel_port_del, .tm_ops_get = ixgbe_tm_ops_get, .tx_done_cleanup = ixgbe_dev_tx_done_cleanup, + .get_monitor_addr = ixgbe_get_monitor_addr, }; /* diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index 6cfbb582e2..7e046a1819 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.c +++ b/drivers/net/ixgbe/ixgbe_rxtx.c @@ -1369,6 +1369,31 @@ const uint32_t RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP, }; +int +ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + volatile union ixgbe_adv_rx_desc *rxdp; + struct ixgbe_rx_queue *rxq = rx_queue; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.upper.status_error; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + + /* the registers are 32-bit */ + pmc->data_sz = sizeof(uint32_t); + + return 0; +} + /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */ static inline uint32_t ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask) diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h index 6d2f7c9da3..8a25e98df6 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.h +++ b/drivers/net/ixgbe/ixgbe_rxtx.h @@ -299,5 +299,6 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev); +int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); #endif /* _IXGBE_RXTX_H_ */ -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 09/11] net/i40e: implement power management API 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov ` (7 preceding siblings ...) 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 08/11] net/ixgbe: implement power management API Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 10/11] net/ice: " Anatoly Burakov ` (2 subsequent siblings) 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Liang Ma, Beilei Xing, Jeff Guo, thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Acked-by: Jeff Guo <jia.guo@intel.com> --- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 25 +++++++++++++++++++++++++ drivers/net/i40e/i40e_rxtx.h | 1 + 3 files changed, 27 insertions(+) diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c index f54769c29d..af2577a140 100644 --- a/drivers/net/i40e/i40e_ethdev.c +++ b/drivers/net/i40e/i40e_ethdev.c @@ -510,6 +510,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = { .mtu_set = i40e_dev_mtu_set, .tm_ops_get = i40e_tm_ops_get, .tx_done_cleanup = i40e_tx_done_cleanup, + .get_monitor_addr = i40e_get_monitor_addr, }; /* store statistics names and its offset in stats structure */ diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 5df9a9df56..0b4220fc9c 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -72,6 +72,31 @@ #define I40E_TX_OFFLOAD_NOTSUP_MASK \ (PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK) +int +i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + struct i40e_rx_queue *rxq = rx_queue; + volatile union i40e_rx_desc *rxdp; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.qword1.status_error_len; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + + /* registers are 64-bit */ + pmc->data_sz = sizeof(uint64_t); + + return 0; +} + static inline void i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp) { diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h index 57d7b4160b..e1494525ce 100644 --- a/drivers/net/i40e/i40e_rxtx.h +++ b/drivers/net/i40e/i40e_rxtx.h @@ -248,6 +248,7 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); +int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); /* For each value it means, datasheet of hardware can tell more details * -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 10/11] net/ice: implement power management API 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov ` (8 preceding siblings ...) 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 09/11] net/i40e: " Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Liang Ma, Qiming Yang, Qi Zhang, thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++++++++++++++++++++++++++ drivers/net/ice/ice_rxtx.h | 1 + 3 files changed, 28 insertions(+) diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c index 9a5d6a559f..c21682c120 100644 --- a/drivers/net/ice/ice_ethdev.c +++ b/drivers/net/ice/ice_ethdev.c @@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = { .udp_tunnel_port_add = ice_dev_udp_tunnel_port_add, .udp_tunnel_port_del = ice_dev_udp_tunnel_port_del, .tx_done_cleanup = ice_tx_done_cleanup, + .get_monitor_addr = ice_get_monitor_addr, }; /* store statistics names and its offset in stats structure */ diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c index 5fbd68eafc..fa9e9a235b 100644 --- a/drivers/net/ice/ice_rxtx.c +++ b/drivers/net/ice/ice_rxtx.c @@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask; uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask; uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask; +int +ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + volatile union ice_rx_flex_desc *rxdp; + struct ice_rx_queue *rxq = rx_queue; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.status_error0; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + + /* register is 16-bit */ + pmc->data_sz = sizeof(uint16_t); + + return 0; +} + + static inline uint8_t ice_proto_xtr_type_to_rxdid(uint8_t xtr_type) { diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h index 6b16716063..906fbefdc4 100644 --- a/drivers/net/ice/ice_rxtx.h +++ b/drivers/net/ice/ice_rxtx.h @@ -263,6 +263,7 @@ uint16_t ice_xmit_pkts_vec_avx512(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc); int ice_tx_done_cleanup(void *txq, uint32_t free_cnt); +int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \ int i; \ -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v15 11/11] examples/l3fwd-power: enable PMD power mgmt 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov ` (9 preceding siblings ...) 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 10/11] net/ice: " Anatoly Burakov @ 2021-01-11 14:58 ` Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov 11 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-11 14:58 UTC (permalink / raw) To: dev Cc: Liang Ma, David Hunt, thomas, konstantin.ananyev, timothy.mcdaniel, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add PMD power management feature support to l3fwd-power sample app. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v12: - Allow selecting PMD power management scheme from command-line - Enforce 1 core 1 queue rule .../sample_app_ug/l3_forward_power_man.rst | 35 ++++++++ examples/l3fwd-power/main.c | 89 ++++++++++++++++++- 2 files changed, 122 insertions(+), 2 deletions(-) diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst index 85a78a5c1e..aaa9367fae 100644 --- a/doc/guides/sample_app_ug/l3_forward_power_man.rst +++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst @@ -109,6 +109,8 @@ where, * --telemetry: Telemetry mode. +* --pmd-mgmt: PMD power management mode. + See :doc:`l3_forward` for details. The L3fwd-power example reuses the L3fwd command line options. @@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set to either 0% or The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``. + +PMD power management Mode +------------------------- + +The PMD power management mode support for ``l3fwd-power`` is a standalone mode, in this mode +``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific +port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API. + +.. code-block:: console + + ./build/examples/dpdk-l3fwd-power -l 1-3 -- --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)" + +PMD Power Management Mode +------------------------- +There is also a traffic-aware operating mode that, instead of using explicit +power management, will use automatic PMD power management. This mode is limited +to one queue per core, and has three available power management schemes: + +* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a + power-optimized state (subject to platform support). + +* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid + busy looping when there is no traffic. + +* ``scale`` - this will use frequency scaling routines available in the + ``librte_power`` library. + +See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK +Programmer's Guide for more details on PMD power management. + +.. code-block:: console + + ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index 995a3b6ad7..e312b6f355 100644 --- a/examples/l3fwd-power/main.c +++ b/examples/l3fwd-power/main.c @@ -47,6 +47,7 @@ #include <rte_power_empty_poll.h> #include <rte_metrics.h> #include <rte_telemetry.h> +#include <rte_power_pmd_mgmt.h> #include "perf_core.h" #include "main.h" @@ -199,11 +200,14 @@ enum appmode { APP_MODE_LEGACY, APP_MODE_EMPTY_POLL, APP_MODE_TELEMETRY, - APP_MODE_INTERRUPT + APP_MODE_INTERRUPT, + APP_MODE_PMD_MGMT }; enum appmode app_mode; +static enum rte_power_pmd_mgmt_type pmgmt_type; + enum freq_scale_hint_t { FREQ_LOWER = -1, @@ -1611,7 +1615,9 @@ print_usage(const char *prgname) " follow (training_flag, high_threshold, med_threshold)\n" " --telemetry: enable telemetry mode, to update" " empty polls, full polls, and core busyness to telemetry\n" - " --interrupt-only: enable interrupt-only mode\n", + " --interrupt-only: enable interrupt-only mode\n" + " --pmd-mgmt MODE: enable PMD power management mode. " + "Currently supported modes: monitor, pause, scale\n", prgname); } @@ -1701,6 +1707,32 @@ parse_config(const char *q_arg) return 0; } + +static int +parse_pmd_mgmt_config(const char *name) +{ +#define PMD_MGMT_MONITOR "monitor" +#define PMD_MGMT_PAUSE "pause" +#define PMD_MGMT_SCALE "scale" + + if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR; + return 0; + } + + if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE; + return 0; + } + + if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE; + return 0; + } + /* unknown PMD power management mode */ + return -1; +} + static int parse_ep_config(const char *q_arg) { @@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg) #define CMD_LINE_OPT_EMPTY_POLL "empty-poll" #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only" #define CMD_LINE_OPT_TELEMETRY "telemetry" +#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt" /* Parse the argument given in the command line of the application */ static int @@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv) {CMD_LINE_OPT_LEGACY, 0, 0, 0}, {CMD_LINE_OPT_TELEMETRY, 0, 0, 0}, {CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0}, + {CMD_LINE_OPT_PMD_MGMT, 1, 0, 0}, {NULL, 0, 0, 0} }; @@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv) printf("telemetry mode is enabled\n"); } + if (!strncmp(lgopts[option_index].name, + CMD_LINE_OPT_PMD_MGMT, + sizeof(CMD_LINE_OPT_PMD_MGMT))) { + if (app_mode != APP_MODE_DEFAULT) { + printf(" power mgmt mode is mutually exclusive with other modes\n"); + return -1; + } + if (parse_pmd_mgmt_config(optarg) < 0) { + printf(" Invalid PMD power management mode: %s\n", + optarg); + return -1; + } + app_mode = APP_MODE_PMD_MGMT; + printf("PMD power mgmt mode is enabled\n"); + } if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_INTERRUPT_ONLY, sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) { @@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode) return "telemetry"; case APP_MODE_INTERRUPT: return "interrupt-only"; + case APP_MODE_PMD_MGMT: + return "pmd mgmt"; default: return "invalid"; } @@ -2671,6 +2722,13 @@ main(int argc, char **argv) qconf = &lcore_conf[lcore_id]; printf("\nInitializing rx queues on lcore %u ... ", lcore_id ); fflush(stdout); + + /* PMD power management mode can only do 1 queue per core */ + if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) { + rte_exit(EXIT_FAILURE, + "In PMD power management mode, only one queue per lcore is allowed\n"); + } + /* init RX queues */ for(queue = 0; queue < qconf->n_rx_queue; ++queue) { struct rte_eth_rxconf rxq_conf; @@ -2708,6 +2766,16 @@ main(int argc, char **argv) rte_exit(EXIT_FAILURE, "Fail to add ptype cb\n"); } + + if (app_mode == APP_MODE_PMD_MGMT) { + ret = rte_power_pmd_mgmt_queue_enable( + lcore_id, portid, queueid, + pmgmt_type); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_power_pmd_mgmt_queue_enable: err=%d, port=%d\n", + ret, portid); + } } } @@ -2798,6 +2866,9 @@ main(int argc, char **argv) SKIP_MAIN); } else if (app_mode == APP_MODE_INTERRUPT) { rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN); + } else if (app_mode == APP_MODE_PMD_MGMT) { + /* reuse telemetry loop for PMD power management mode */ + rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN); } if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY) @@ -2824,6 +2895,20 @@ main(int argc, char **argv) if (app_mode == APP_MODE_EMPTY_POLL) rte_power_empty_poll_stat_free(); + if (app_mode == APP_MODE_PMD_MGMT) { + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + qconf = &lcore_conf[lcore_id]; + for (queue = 0; queue < qconf->n_rx_queue; ++queue) { + portid = qconf->rx_queue_list[queue].port_id; + queueid = qconf->rx_queue_list[queue].queue_id; + rte_power_pmd_mgmt_queue_disable(lcore_id, + portid, queueid); + } + } + } + if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) && deinit_power_library()) rte_exit(EXIT_FAILURE, "deinit_power_library failed\n"); -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 00/11] Add PMD power management 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 00/11] Add PMD power management Anatoly Burakov ` (10 preceding siblings ...) 2021-01-11 14:58 ` [dpdk-dev] [PATCH v15 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 01/11] eal: uninline power intrinsics Anatoly Burakov ` (12 more replies) 11 siblings, 13 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara This patchset proposes a simple API for Ethernet drivers to cause the CPU to enter a power-optimized state while waiting for packets to arrive. There are multiple proposed mechanisms to achieve said power savings: simple frequency scaling, idle loop, and monitoring the Rx queue for incoming packages. The latter is achieved through cooperation with the NIC driver that will allow us to know address of wake up event, and wait for writes on that address. On IA, this is achieved through using UMONITOR/UMWAIT instructions. They are used in their raw opcode form because there is no widespread compiler support for them yet. Still, the API is made generic enough to hopefully support other architectures, if they happen to implement similar instructions. To achieve power savings, there is a very simple mechanism used: we're counting empty polls, and if a certain threshold is reached, we employ one of the suggested power management schemes automatically, from within a Rx callback inside the PMD. Once there's traffic again, the empty poll counter is reset. This patchset also introduces a few changes into existing power management-related intrinsics, namely to provide a native way of waking up a sleeping core without application being responsible for it, as well as general robustness improvements. There's quite a bit of locking going on, but these locks are per-thread and very little (if any) contention is expected, so the performance impact shouldn't be that bad (and in any case the locking happens when we're about to sleep anyway). Why are we putting it into ethdev as opposed to leaving this up to the application? Our customers specifically requested a way to do it with minimal changes to the application code. The current approach allows to just flip a switch and automatically have power savings. Things of note: - Only 1:1 core to queue mapping is supported, meaning that each lcore must at most handle RX on a single queue - Support 3 type policies. Monitor/Pause/Frequency Scaling - Power management is enabled per-queue - The API doesn't extend to other device types v16: - Implemented Konstantin's suggestions and comments - Added return values to the API v15: - Fixed incorrect check in UMWAIT callback - Fixed accidental whitespace changes v14: - Fixed ARM/PPC builds - Addressed various review comments v13: - Reworked the librte_power code to require less locking and handle invalid parameters better - Fix numerous rebase errors present in v12 v12: - Rebase on top of 21.02 - Rework of power intrinsics code Anatoly Burakov (5): eal: uninline power intrinsics eal: avoid invalid API usage in power intrinsics eal: change API of power intrinsics eal: remove sync version of power monitor eal: add monitor wakeup function Liang Ma (6): ethdev: add simple power management API power: add PMD power management API and callback net/ixgbe: implement power management API net/i40e: implement power management API net/ice: implement power management API examples/l3fwd-power: enable PMD power mgmt doc/guides/prog_guide/power_man.rst | 44 +++ doc/guides/rel_notes/release_21_02.rst | 15 + .../sample_app_ug/l3_forward_power_man.rst | 35 ++ drivers/event/dlb/dlb.c | 10 +- drivers/event/dlb2/dlb2.c | 10 +- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 25 ++ drivers/net/i40e/i40e_rxtx.h | 1 + drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++ drivers/net/ice/ice_rxtx.h | 1 + drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 ++ drivers/net/ixgbe/ixgbe_rxtx.h | 1 + examples/l3fwd-power/main.c | 89 ++++- .../arm/include/rte_power_intrinsics.h | 40 -- lib/librte_eal/arm/meson.build | 1 + lib/librte_eal/arm/rte_power_intrinsics.c | 38 ++ .../include/generic/rte_power_intrinsics.h | 88 ++--- .../ppc/include/rte_power_intrinsics.h | 40 -- lib/librte_eal/ppc/meson.build | 1 + lib/librte_eal/ppc/rte_power_intrinsics.c | 38 ++ lib/librte_eal/version.map | 3 + .../x86/include/rte_power_intrinsics.h | 115 ------ lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 220 +++++++++++ lib/librte_ethdev/rte_ethdev.c | 28 ++ lib/librte_ethdev/rte_ethdev.h | 25 ++ lib/librte_ethdev/rte_ethdev_driver.h | 22 ++ lib/librte_ethdev/version.map | 3 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 359 ++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++ lib/librte_power/version.map | 5 + 34 files changed, 1148 insertions(+), 259 deletions(-) create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 01/11] eal: uninline power intrinsics 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 02/11] eal: avoid invalid API usage in " Anatoly Burakov ` (11 subsequent siblings) 12 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, power intrinsics are inline functions. Make them part of the ABI so that we can have various internal data associated with them without exposing said data to the outside world. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- Notes: v14: - Fix compile issues on ARM and PPC64 by moving implementations to .c files .../arm/include/rte_power_intrinsics.h | 40 ------ lib/librte_eal/arm/meson.build | 1 + lib/librte_eal/arm/rte_power_intrinsics.c | 45 +++++++ .../include/generic/rte_power_intrinsics.h | 6 +- .../ppc/include/rte_power_intrinsics.h | 40 ------ lib/librte_eal/ppc/meson.build | 1 + lib/librte_eal/ppc/rte_power_intrinsics.c | 45 +++++++ lib/librte_eal/version.map | 3 + .../x86/include/rte_power_intrinsics.h | 115 ----------------- lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 120 ++++++++++++++++++ 11 files changed, 219 insertions(+), 198 deletions(-) create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h index a4a1bc1159..9e498e9ebf 100644 --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h @@ -13,46 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - RTE_SET_USED(tsc_timestamp); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build index d62875ebae..6ec53ea03a 100644 --- a/lib/librte_eal/arm/meson.build +++ b/lib/librte_eal/arm/meson.build @@ -7,4 +7,5 @@ sources += files( 'rte_cpuflags.c', 'rte_cycles.c', 'rte_hypervisor.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c new file mode 100644 index 0000000000..ab1f44f611 --- /dev/null +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -0,0 +1,45 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2021 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +/** + * This function is not supported on ARM. + */ +void +rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on ARM. + */ +void +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(lck); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on ARM. + */ +void +rte_power_pause(const uint64_t tsc_timestamp) +{ + RTE_SET_USED(tsc_timestamp); +} diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index dd520d90fa..67977bd511 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -52,7 +52,7 @@ * to undefined result. */ __rte_experimental -static inline void rte_power_monitor(const volatile void *p, +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz); @@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p, * wakes up. */ __rte_experimental -static inline void rte_power_monitor_sync(const volatile void *p, +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck); @@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p, * architecture-dependent. */ __rte_experimental -static inline void rte_power_pause(const uint64_t tsc_timestamp); +void rte_power_pause(const uint64_t tsc_timestamp); #endif /* _RTE_POWER_INTRINSIC_H_ */ diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h index 4ed03d521f..c0e9ac279f 100644 --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h @@ -13,46 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - RTE_SET_USED(tsc_timestamp); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/ppc/meson.build b/lib/librte_eal/ppc/meson.build index f4b6d95c42..43c46542fb 100644 --- a/lib/librte_eal/ppc/meson.build +++ b/lib/librte_eal/ppc/meson.build @@ -7,4 +7,5 @@ sources += files( 'rte_cpuflags.c', 'rte_cycles.c', 'rte_hypervisor.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c new file mode 100644 index 0000000000..84340ca2a4 --- /dev/null +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -0,0 +1,45 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2021 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +/** + * This function is not supported on PPC64. + */ +void +rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on PPC64. + */ +void +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(lck); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on PPC64. + */ +void +rte_power_pause(const uint64_t tsc_timestamp) +{ + RTE_SET_USED(tsc_timestamp); +} diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index b1db7ec795..32eceb8869 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -405,6 +405,9 @@ EXPERIMENTAL { rte_vect_set_max_simd_bitwidth; # added in 21.02 + rte_power_monitor; + rte_power_monitor_sync; + rte_power_pause; rte_thread_tls_key_create; rte_thread_tls_key_delete; rte_thread_tls_value_get; diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h index c7d790c854..e4c2b87f73 100644 --- a/lib/librte_eal/x86/include/rte_power_intrinsics.h +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h @@ -13,121 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -static inline uint64_t -__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz) -{ - switch (sz) { - case sizeof(uint8_t): - return *(const volatile uint8_t *)p; - case sizeof(uint16_t): - return *(const volatile uint16_t *)p; - case sizeof(uint32_t): - return *(const volatile uint32_t *)p; - case sizeof(uint64_t): - return *(const volatile uint64_t *)p; - default: - /* this is an intrinsic, so we can't have any error handling */ - RTE_ASSERT(0); - return 0; - } -} - -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(p)); - - if (value_mask) { - const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; - - /* if the masked value is already matching, abort */ - if (masked == expected_value) - return; - } - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); -} - -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(p)); - - if (value_mask) { - const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; - - /* if the masked value is already matching, abort */ - if (masked == expected_value) - return; - } - rte_spinlock_unlock(lck); - - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); - - rte_spinlock_lock(lck); -} - -/** - * This function uses TPAUSE instruction and will enter C0.2 state. For more - * information about usage of this instruction, please refer to Intel(R) 64 and - * IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - - /* execute TPAUSE */ - asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build index e78f29002e..dfd42dee0c 100644 --- a/lib/librte_eal/x86/meson.build +++ b/lib/librte_eal/x86/meson.build @@ -8,4 +8,5 @@ sources += files( 'rte_cycles.c', 'rte_hypervisor.c', 'rte_spinlock.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c new file mode 100644 index 0000000000..34c5fd9c3e --- /dev/null +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -0,0 +1,120 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +static inline uint64_t +__get_umwait_val(const volatile void *p, const uint8_t sz) +{ + switch (sz) { + case sizeof(uint8_t): + return *(const volatile uint8_t *)p; + case sizeof(uint16_t): + return *(const volatile uint16_t *)p; + case sizeof(uint32_t): + return *(const volatile uint32_t *)p; + case sizeof(uint64_t): + return *(const volatile uint64_t *)p; + default: + /* this is an intrinsic, so we can't have any error handling */ + RTE_ASSERT(0); + return 0; + } +} + +/** + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. + * For more information about usage of these instructions, please refer to + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + + if (value_mask) { + const uint64_t cur_value = __get_umwait_val(p, data_sz); + const uint64_t masked = cur_value & value_mask; + + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return; + } + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} + +/** + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. + * For more information about usage of these instructions, please refer to + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + + if (value_mask) { + const uint64_t cur_value = __get_umwait_val(p, data_sz); + const uint64_t masked = cur_value & value_mask; + + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return; + } + rte_spinlock_unlock(lck); + + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); + + rte_spinlock_lock(lck); +} + +/** + * This function uses TPAUSE instruction and will enter C0.2 state. For more + * information about usage of this instruction, please refer to Intel(R) 64 and + * IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_pause(const uint64_t tsc_timestamp) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* execute TPAUSE */ + asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 02/11] eal: avoid invalid API usage in power intrinsics 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 01/11] eal: uninline power intrinsics Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 03/11] eal: change API of " Anatoly Burakov ` (10 subsequent siblings) 12 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, the API documentation mandates that if the user wants to use the power management intrinsics, they need to call the `rte_cpu_get_intrinsics_support` API and check support for specific intrinsics. However, if the user does not do that, it is possible to get illegal instruction error because we're using raw instruction opcodes, which may or may not be supported at runtime. Now that we have everything in a C file, we can check for support at startup and prevent the user from possibly encountering illegal instruction errors. We also add return values to the API's as well, because why not. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- Notes: v16: - Add return values and proper error handling to the API v15: - Remove accidental whitespace changes v14: - Replace uint8_t with bool v14: - Replace uint8_t with bool lib/librte_eal/arm/rte_power_intrinsics.c | 12 +++- .../include/generic/rte_power_intrinsics.h | 24 +++++-- lib/librte_eal/ppc/rte_power_intrinsics.c | 12 +++- lib/librte_eal/x86/rte_power_intrinsics.c | 64 +++++++++++++++++-- 4 files changed, 94 insertions(+), 18 deletions(-) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index ab1f44f611..7e7552fa8a 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -7,7 +7,7 @@ /** * This function is not supported on ARM. */ -void +int rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz) @@ -17,12 +17,14 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, RTE_SET_USED(value_mask); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(data_sz); + + return -ENOTSUP; } /** * This function is not supported on ARM. */ -void +int rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck) @@ -33,13 +35,17 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); RTE_SET_USED(data_sz); + + return -ENOTSUP; } /** * This function is not supported on ARM. */ -void +int rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); + + return -ENOTSUP; } diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 67977bd511..37e4ec0414 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -34,7 +34,6 @@ * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param p * Address to monitor for changes. @@ -50,9 +49,14 @@ * Data size (in bytes) that will be used to compare expected value with the * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead * to undefined result. + * + * @return + * 0 on success + * -EINVAL on invalid parameters + * -ENOTSUP if unsupported */ __rte_experimental -void rte_power_monitor(const volatile void *p, +int rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz); @@ -75,7 +79,6 @@ void rte_power_monitor(const volatile void *p, * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param p * Address to monitor for changes. @@ -95,9 +98,14 @@ void rte_power_monitor(const volatile void *p, * A spinlock that must be locked before entering the function, will be * unlocked while the CPU is sleeping, and will be locked again once the CPU * wakes up. + * + * @return + * 0 on success + * -EINVAL on invalid parameters + * -ENOTSUP if unsupported */ __rte_experimental -void rte_power_monitor_sync(const volatile void *p, +int rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck); @@ -111,13 +119,17 @@ void rte_power_monitor_sync(const volatile void *p, * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. + * + * @return + * 0 on success + * -EINVAL on invalid parameters + * -ENOTSUP if unsupported */ __rte_experimental -void rte_power_pause(const uint64_t tsc_timestamp); +int rte_power_pause(const uint64_t tsc_timestamp); #endif /* _RTE_POWER_INTRINSIC_H_ */ diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 84340ca2a4..929e0611b0 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -7,7 +7,7 @@ /** * This function is not supported on PPC64. */ -void +int rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz) @@ -17,12 +17,14 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, RTE_SET_USED(value_mask); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(data_sz); + + return -ENOTSUP; } /** * This function is not supported on PPC64. */ -void +int rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck) @@ -33,13 +35,17 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); RTE_SET_USED(data_sz); + + return -ENOTSUP; } /** * This function is not supported on PPC64. */ -void +int rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); + + return -ENOTSUP; } diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 34c5fd9c3e..2a38440bec 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -4,6 +4,8 @@ #include "rte_power_intrinsics.h" +static bool wait_supported; + static inline uint64_t __get_umwait_val(const volatile void *p, const uint8_t sz) { @@ -17,24 +19,47 @@ __get_umwait_val(const volatile void *p, const uint8_t sz) case sizeof(uint64_t): return *(const volatile uint64_t *)p; default: - /* this is an intrinsic, so we can't have any error handling */ + /* shouldn't happen */ RTE_ASSERT(0); return 0; } } +static inline int +__check_val_size(const uint8_t sz) +{ + switch (sz) { + case sizeof(uint8_t): /* fall-through */ + case sizeof(uint16_t): /* fall-through */ + case sizeof(uint32_t): /* fall-through */ + case sizeof(uint64_t): /* fall-through */ + return 0; + default: + /* unexpected size */ + return -1; + } +} + /** * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. * For more information about usage of these instructions, please refer to * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ -void +int rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return -ENOTSUP; + + if (__check_val_size(data_sz) < 0) + return -EINVAL; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -51,13 +76,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, /* if the masked value is already matching, abort */ if (masked == expected_value) - return; + return 0; } /* execute UMWAIT */ asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" : /* ignore rflags */ : "D"(0), /* enter C0.2 */ "a"(tsc_l), "d"(tsc_h)); + + return 0; } /** @@ -65,13 +92,21 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * For more information about usage of these instructions, please refer to * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ -void +int rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return -ENOTSUP; + + if (__check_val_size(data_sz) < 0) + return -EINVAL; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -88,7 +123,7 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, /* if the masked value is already matching, abort */ if (masked == expected_value) - return; + return 0; } rte_spinlock_unlock(lck); @@ -99,6 +134,8 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, "a"(tsc_l), "d"(tsc_h)); rte_spinlock_lock(lck); + + return 0; } /** @@ -106,15 +143,30 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, * information about usage of this instruction, please refer to Intel(R) 64 and * IA-32 Architectures Software Developer's Manual. */ -void +int rte_power_pause(const uint64_t tsc_timestamp) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return -ENOTSUP; + /* execute TPAUSE */ asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" : /* ignore rflags */ : "D"(0), /* enter C0.2 */ "a"(tsc_l), "d"(tsc_h)); + + return 0; +} + +RTE_INIT(rte_power_intrinsics_init) { + struct rte_cpu_intrinsics i; + + rte_cpu_get_intrinsics_support(&i); + + if (i.power_monitor && i.power_pause) + wait_supported = 1; } -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 01/11] eal: uninline power intrinsics Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 02/11] eal: avoid invalid API usage in " Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-13 13:01 ` Ananyev, Konstantin 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 04/11] eal: remove sync version of power monitor Anatoly Burakov ` (9 subsequent siblings) 12 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Timothy McDaniel, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Bruce Richardson, Konstantin Ananyev, thomas, david.hunt, chris.macnamara Instead of passing around pointers and integers, collect everything into struct. This makes API design around these intrinsics much easier. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- Notes: v16: - Add error handling drivers/event/dlb/dlb.c | 10 ++-- drivers/event/dlb2/dlb2.c | 10 ++-- lib/librte_eal/arm/rte_power_intrinsics.c | 20 +++----- .../include/generic/rte_power_intrinsics.h | 50 ++++++++----------- lib/librte_eal/ppc/rte_power_intrinsics.c | 20 +++----- lib/librte_eal/x86/rte_power_intrinsics.c | 42 +++++++++------- 6 files changed, 70 insertions(+), 82 deletions(-) diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c index 0c95c4793d..d2f2026291 100644 --- a/drivers/event/dlb/dlb.c +++ b/drivers/event/dlb/dlb.c @@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb, /* Interrupts not supported by PF PMD */ return 1; } else if (dlb->umwait_allowed) { + struct rte_power_monitor_cond pmc; volatile struct dlb_dequeue_qe *cq_base; union { uint64_t raw_qe[2]; @@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb, else expected_value = 0; - rte_power_monitor(monitor_addr, expected_value, - qe_mask.raw_qe[1], timeout + start_ticks, - sizeof(uint64_t)); + pmc.addr = monitor_addr; + pmc.val = expected_value; + pmc.mask = qe_mask.raw_qe[1]; + pmc.data_sz = sizeof(uint64_t); + + rte_power_monitor(&pmc, timeout + start_ticks); DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1); } else { diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c index 86724863f2..c9a8a02278 100644 --- a/drivers/event/dlb2/dlb2.c +++ b/drivers/event/dlb2/dlb2.c @@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, if (elapsed_ticks >= timeout) { return 1; } else if (dlb2->umwait_allowed) { + struct rte_power_monitor_cond pmc; volatile struct dlb2_dequeue_qe *cq_base; union { uint64_t raw_qe[2]; @@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, else expected_value = 0; - rte_power_monitor(monitor_addr, expected_value, - qe_mask.raw_qe[1], timeout + start_ticks, - sizeof(uint64_t)); + pmc.addr = monitor_addr; + pmc.val = expected_value; + pmc.mask = qe_mask.raw_qe[1]; + pmc.data_sz = sizeof(uint64_t); + + rte_power_monitor(&pmc, timeout + start_ticks); DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1); } else { diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index 7e7552fa8a..5f1caaf25b 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -8,15 +8,11 @@ * This function is not supported on ARM. */ int -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); return -ENOTSUP; } @@ -25,16 +21,12 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * This function is not supported on ARM. */ int -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); - RTE_SET_USED(data_sz); return -ENOTSUP; } diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 37e4ec0414..3ad53068d5 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -18,6 +18,18 @@ * which are architecture-dependent. */ +struct rte_power_monitor_cond { + volatile void *addr; /**< Address to monitor for changes */ + uint64_t val; /**< Before attempting the monitoring, the address + * may be read and compared against this value. + **/ + uint64_t mask; /**< 64-bit mask to extract current value from addr */ + uint8_t data_sz; /**< Data size (in bytes) that will be used to compare + * expected value with the memory address. Can be 1, + * 2, 4, or 8. Supplying any other value will lead to + * undefined result. */ +}; + /** * @warning * @b EXPERIMENTAL: this API may change without prior notice @@ -35,20 +47,11 @@ * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. * - * @param p - * Address to monitor for changes. - * @param expected_value - * Before attempting the monitoring, the `p` address may be read and compared - * against this value. If `value_mask` is zero, this step will be skipped. - * @param value_mask - * The 64-bit mask to use to extract current value from `p`. + * @param pmc + * The monitoring condition structure. * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. - * @param data_sz - * Data size (in bytes) that will be used to compare expected value with the - * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead - * to undefined result. * * @return * 0 on success @@ -56,10 +59,8 @@ * -ENOTSUP if unsupported */ __rte_experimental -int rte_power_monitor(const volatile void *p, - const uint64_t expected_value, const uint64_t value_mask, - const uint64_t tsc_timestamp, const uint8_t data_sz); - +int rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp); /** * @warning * @b EXPERIMENTAL: this API may change without prior notice @@ -80,20 +81,11 @@ int rte_power_monitor(const volatile void *p, * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. * - * @param p - * Address to monitor for changes. - * @param expected_value - * Before attempting the monitoring, the `p` address may be read and compared - * against this value. If `value_mask` is zero, this step will be skipped. - * @param value_mask - * The 64-bit mask to use to extract current value from `p`. + * @param pmc + * The monitoring condition structure. * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. - * @param data_sz - * Data size (in bytes) that will be used to compare expected value with the - * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead - * to undefined result. * @param lck * A spinlock that must be locked before entering the function, will be * unlocked while the CPU is sleeping, and will be locked again once the CPU @@ -105,10 +97,8 @@ int rte_power_monitor(const volatile void *p, * -ENOTSUP if unsupported */ __rte_experimental -int rte_power_monitor_sync(const volatile void *p, - const uint64_t expected_value, const uint64_t value_mask, - const uint64_t tsc_timestamp, const uint8_t data_sz, - rte_spinlock_t *lck); +int rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck); /** * @warning diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 929e0611b0..5e5a1fff5a 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -8,15 +8,11 @@ * This function is not supported on PPC64. */ int -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); return -ENOTSUP; } @@ -25,16 +21,12 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * This function is not supported on PPC64. */ int -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); - RTE_SET_USED(data_sz); return -ENOTSUP; } diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 2a38440bec..6be5c8b9f1 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -46,9 +46,8 @@ __check_val_size(const uint8_t sz) * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ int -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); @@ -57,7 +56,10 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, if (!wait_supported) return -ENOTSUP; - if (__check_val_size(data_sz) < 0) + if (pmc == NULL) + return -EINVAL; + + if (__check_val_size(pmc->data_sz) < 0) return -EINVAL; /* @@ -68,14 +70,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, /* set address for UMONITOR */ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" : - : "D"(p)); + : "D"(pmc->addr)); - if (value_mask) { - const uint64_t cur_value = __get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; + if (pmc->mask) { + const uint64_t cur_value = __get_umwait_val( + pmc->addr, pmc->data_sz); + const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == expected_value) + if (masked == pmc->val) return 0; } /* execute UMWAIT */ @@ -93,9 +96,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ int -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); @@ -104,7 +106,10 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, if (!wait_supported) return -ENOTSUP; - if (__check_val_size(data_sz) < 0) + if (pmc == NULL || lck == NULL) + return -EINVAL; + + if (__check_val_size(pmc->data_sz) < 0) return -EINVAL; /* @@ -115,14 +120,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, /* set address for UMONITOR */ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" : - : "D"(p)); + : "D"(pmc->addr)); - if (value_mask) { - const uint64_t cur_value = __get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; + if (pmc->mask) { + const uint64_t cur_value = __get_umwait_val( + pmc->addr, pmc->data_sz); + const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == expected_value) + if (masked == pmc->val) return 0; } rte_spinlock_unlock(lck); -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 03/11] eal: change API of " Anatoly Burakov @ 2021-01-13 13:01 ` Ananyev, Konstantin 2021-01-13 17:22 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-13 13:01 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: McDaniel, Timothy, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Richardson, Bruce, thomas, Hunt, David, Macnamara, Chris > > Instead of passing around pointers and integers, collect everything > into struct. This makes API design around these intrinsics much easier. > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > --- > > Notes: > v16: > - Add error handling There are few trivial checkpatch warnings, please check ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics 2021-01-13 13:01 ` Ananyev, Konstantin @ 2021-01-13 17:22 ` Burakov, Anatoly 2021-01-13 18:01 ` Ananyev, Konstantin 0 siblings, 1 reply; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-13 17:22 UTC (permalink / raw) To: Ananyev, Konstantin, dev Cc: McDaniel, Timothy, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Richardson, Bruce, thomas, Hunt, David, Macnamara, Chris On 13-Jan-21 1:01 PM, Ananyev, Konstantin wrote: > >> >> Instead of passing around pointers and integers, collect everything >> into struct. This makes API design around these intrinsics much easier. >> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> >> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> >> --- >> >> Notes: >> v16: >> - Add error handling > > There are few trivial checkpatch warnings, please check > To paraphrase Nick Fury, I recognize that checkpatch has produced warnings, but given that i don't agree with what checkpatch has to say in this case, I've elected to ignore it :) In particular, these warnings related to comments around struct members, which i think i've made to look nice and also took care of correct indentation in terms of code looking the same way with different tab widths. So, i don't think it should be changed, unless you're suggesting to re-layout comments on top of each member, rather than at the side (which i think is more readable). -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics 2021-01-13 17:22 ` Burakov, Anatoly @ 2021-01-13 18:01 ` Ananyev, Konstantin 2021-01-14 10:23 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-13 18:01 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: McDaniel, Timothy, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Richardson, Bruce, thomas, Hunt, David, Macnamara, Chris > On 13-Jan-21 1:01 PM, Ananyev, Konstantin wrote: > > > >> > >> Instead of passing around pointers and integers, collect everything > >> into struct. This makes API design around these intrinsics much easier. > >> > >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > >> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > >> --- > >> > >> Notes: > >> v16: > >> - Add error handling > > > > There are few trivial checkpatch warnings, please check > > > > To paraphrase Nick Fury, I recognize that checkpatch has produced > warnings, but given that i don't agree with what checkpatch has to say > in this case, I've elected to ignore it :) > > In particular, these warnings related to comments around struct members, > which i think i've made to look nice and also took care of correct > indentation in terms of code looking the same way with different tab > widths. So, i don't think it should be changed, unless you're suggesting > to re-layout comments on top of each member, rather than at the side > (which i think is more readable). If top is not an option, it is possible to move comment on next after actual field lines: uint32_t x; /**< * blah, blah * blah, blah, blah */ AFAIK that would keep checkpatch happy. ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics 2021-01-13 18:01 ` Ananyev, Konstantin @ 2021-01-14 10:23 ` Burakov, Anatoly 2021-01-14 12:33 ` Ananyev, Konstantin 0 siblings, 1 reply; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-14 10:23 UTC (permalink / raw) To: Ananyev, Konstantin, dev Cc: McDaniel, Timothy, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Richardson, Bruce, thomas, Hunt, David, Macnamara, Chris On 13-Jan-21 6:01 PM, Ananyev, Konstantin wrote: > >> On 13-Jan-21 1:01 PM, Ananyev, Konstantin wrote: >>> >>>> >>>> Instead of passing around pointers and integers, collect everything >>>> into struct. This makes API design around these intrinsics much easier. >>>> >>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> >>>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> >>>> --- >>>> >>>> Notes: >>>> v16: >>>> - Add error handling >>> >>> There are few trivial checkpatch warnings, please check >>> >> >> To paraphrase Nick Fury, I recognize that checkpatch has produced >> warnings, but given that i don't agree with what checkpatch has to say >> in this case, I've elected to ignore it :) >> >> In particular, these warnings related to comments around struct members, >> which i think i've made to look nice and also took care of correct >> indentation in terms of code looking the same way with different tab >> widths. So, i don't think it should be changed, unless you're suggesting >> to re-layout comments on top of each member, rather than at the side >> (which i think is more readable). > > If top is not an option, it is possible to move comment on next after actual field lines: > uint32_t x; > /**< > * blah, blah > * blah, blah, blah > */ > AFAIK that would keep checkpatch happy. > It's not as much "not an option" as it would look less readable to me than what there currently is. If we're going to keep comments not on the side, then on the top they go. I'd prefer to keep it as is, but if you feel strongly about it, i can change it. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 03/11] eal: change API of power intrinsics 2021-01-14 10:23 ` Burakov, Anatoly @ 2021-01-14 12:33 ` Ananyev, Konstantin 0 siblings, 0 replies; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-14 12:33 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: McDaniel, Timothy, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Richardson, Bruce, thomas, Hunt, David, Macnamara, Chris > > On 13-Jan-21 6:01 PM, Ananyev, Konstantin wrote: > > > >> On 13-Jan-21 1:01 PM, Ananyev, Konstantin wrote: > >>> > >>>> > >>>> Instead of passing around pointers and integers, collect everything > >>>> into struct. This makes API design around these intrinsics much easier. > >>>> > >>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > >>>> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > >>>> --- > >>>> > >>>> Notes: > >>>> v16: > >>>> - Add error handling > >>> > >>> There are few trivial checkpatch warnings, please check > >>> > >> > >> To paraphrase Nick Fury, I recognize that checkpatch has produced > >> warnings, but given that i don't agree with what checkpatch has to say > >> in this case, I've elected to ignore it :) > >> > >> In particular, these warnings related to comments around struct members, > >> which i think i've made to look nice and also took care of correct > >> indentation in terms of code looking the same way with different tab > >> widths. So, i don't think it should be changed, unless you're suggesting > >> to re-layout comments on top of each member, rather than at the side > >> (which i think is more readable). > > > > If top is not an option, it is possible to move comment on next after actual field lines: > > uint32_t x; > > /**< > > * blah, blah > > * blah, blah, blah > > */ > > AFAIK that would keep checkpatch happy. > > > > It's not as much "not an option" as it would look less readable to me > than what there currently is. If we're going to keep comments not on the > side, then on the top they go. I'd prefer to keep it as is, but if you > feel strongly about it, i can change it. I don't have any preferences about comments placement. I just thought it would be good to keep checkpatch happy. Specially if the changes in question are just cosmetic ones. ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 04/11] eal: remove sync version of power monitor 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov ` (2 preceding siblings ...) 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 03/11] eal: change API of " Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function Anatoly Burakov ` (8 subsequent siblings) 12 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, the "sync" version of power monitor intrinsic is supposed to be used for purposes of waking up a sleeping core. However, there are better ways to achieve the same result, so remove the unneeded function. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- lib/librte_eal/arm/rte_power_intrinsics.c | 14 ----- .../include/generic/rte_power_intrinsics.h | 38 ------------- lib/librte_eal/ppc/rte_power_intrinsics.c | 14 ----- lib/librte_eal/version.map | 1 - lib/librte_eal/x86/rte_power_intrinsics.c | 54 ------------------- 5 files changed, 121 deletions(-) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index 5f1caaf25b..8d271dc0c1 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -17,20 +17,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, return -ENOTSUP; } -/** - * This function is not supported on ARM. - */ -int -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - RTE_SET_USED(pmc); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); - - return -ENOTSUP; -} - /** * This function is not supported on ARM. */ diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 3ad53068d5..85343bc9eb 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -61,44 +61,6 @@ struct rte_power_monitor_cond { __rte_experimental int rte_power_monitor(const struct rte_power_monitor_cond *pmc, const uint64_t tsc_timestamp); -/** - * @warning - * @b EXPERIMENTAL: this API may change without prior notice - * - * Monitor specific address for changes. This will cause the CPU to enter an - * architecture-defined optimized power state until either the specified - * memory address is written to, a certain TSC timestamp is reached, or other - * reasons cause the CPU to wake up. - * - * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If - * mask is non-zero, the current value pointed to by the `p` pointer will be - * checked against the expected value, and if they match, the entering of - * optimized power state may be aborted. - * - * This call will also lock a spinlock on entering sleep, and release it on - * waking up the CPU. - * - * @warning It is responsibility of the user to check if this function is - * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * - * @param pmc - * The monitoring condition structure. - * @param tsc_timestamp - * Maximum TSC timestamp to wait for. Note that the wait behavior is - * architecture-dependent. - * @param lck - * A spinlock that must be locked before entering the function, will be - * unlocked while the CPU is sleeping, and will be locked again once the CPU - * wakes up. - * - * @return - * 0 on success - * -EINVAL on invalid parameters - * -ENOTSUP if unsupported - */ -__rte_experimental -int rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck); /** * @warning diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 5e5a1fff5a..f7862ea324 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -17,20 +17,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, return -ENOTSUP; } -/** - * This function is not supported on PPC64. - */ -int -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - RTE_SET_USED(pmc); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); - - return -ENOTSUP; -} - /** * This function is not supported on PPC64. */ diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 32eceb8869..1fcd1d3bed 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -406,7 +406,6 @@ EXPERIMENTAL { # added in 21.02 rte_power_monitor; - rte_power_monitor_sync; rte_power_pause; rte_thread_tls_key_create; rte_thread_tls_key_delete; diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 6be5c8b9f1..29247d8638 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -90,60 +90,6 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, return 0; } -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -int -rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, - const uint64_t tsc_timestamp, rte_spinlock_t *lck) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - - /* prevent user from running this instruction if it's not supported */ - if (!wait_supported) - return -ENOTSUP; - - if (pmc == NULL || lck == NULL) - return -EINVAL; - - if (__check_val_size(pmc->data_sz) < 0) - return -EINVAL; - - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(pmc->addr)); - - if (pmc->mask) { - const uint64_t cur_value = __get_umwait_val( - pmc->addr, pmc->data_sz); - const uint64_t masked = cur_value & pmc->mask; - - /* if the masked value is already matching, abort */ - if (masked == pmc->val) - return 0; - } - rte_spinlock_unlock(lck); - - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); - - rte_spinlock_lock(lck); - - return 0; -} - /** * This function uses TPAUSE instruction and will enter C0.2 state. For more * information about usage of this instruction, please refer to Intel(R) 64 and -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov ` (3 preceding siblings ...) 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 04/11] eal: remove sync version of power monitor Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-13 12:46 ` Ananyev, Konstantin 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API Anatoly Burakov ` (7 subsequent siblings) 12 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Now that we have everything in a C file, we can store the information about our sleep, and have a native mechanism to wake up the sleeping core. This mechanism would however only wake up a core that's sleeping while monitoring - waking up from `rte_power_pause` won't work. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v16: - Improve error handling - Take a lock before UMONITOR v13: - Add comments around wakeup code to explain what it does - Add lcore_id parameter checking to prevent buffer overrun lib/librte_eal/arm/rte_power_intrinsics.c | 9 ++ .../include/generic/rte_power_intrinsics.h | 16 +++ lib/librte_eal/ppc/rte_power_intrinsics.c | 9 ++ lib/librte_eal/version.map | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 98 ++++++++++++++++++- 5 files changed, 132 insertions(+), 1 deletion(-) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index 8d271dc0c1..5a24c13913 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -27,3 +27,12 @@ rte_power_pause(const uint64_t tsc_timestamp) return -ENOTSUP; } + +/** + * This function is not supported on ARM. + */ +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + RTE_SET_USED(lcore_id); +} diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 85343bc9eb..6109d28faa 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -62,6 +62,22 @@ __rte_experimental int rte_power_monitor(const struct rte_power_monitor_cond *pmc, const uint64_t tsc_timestamp); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Wake up a specific lcore that is in a power optimized state and is monitoring + * an address. + * + * @note This function will *not* wake up a core that is in a power optimized + * state due to calling `rte_power_pause`. + * + * @param lcore_id + * Lcore ID of a sleeping thread. + */ +__rte_experimental +int rte_power_monitor_wakeup(const unsigned int lcore_id); + /** * @warning * @b EXPERIMENTAL: this API may change without prior notice diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index f7862ea324..7e334f7cf0 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -27,3 +27,12 @@ rte_power_pause(const uint64_t tsc_timestamp) return -ENOTSUP; } + +/** + * This function is not supported on PPC64. + */ +void +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + RTE_SET_USED(lcore_id); +} diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index 1fcd1d3bed..fce90a112f 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -406,6 +406,7 @@ EXPERIMENTAL { # added in 21.02 rte_power_monitor; + rte_power_monitor_wakeup; rte_power_pause; rte_thread_tls_key_create; rte_thread_tls_key_delete; diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 29247d8638..a9e1689f75 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -2,8 +2,31 @@ * Copyright(c) 2020 Intel Corporation */ +#include <rte_common.h> +#include <rte_lcore.h> +#include <rte_spinlock.h> + #include "rte_power_intrinsics.h" +/* + * Per-lcore structure holding current status of C0.2 sleeps. + */ +static struct power_wait_status { + rte_spinlock_t lock; + volatile void *monitor_addr; /**< NULL if not currently sleeping */ +} __rte_cache_aligned wait_status[RTE_MAX_LCORE]; + +static inline void +__umwait_wakeup(volatile void *addr) +{ + uint64_t val; + + /* trigger a write but don't change the value */ + val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED); + __atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0, + __ATOMIC_RELAXED, __ATOMIC_RELAXED); +} + static bool wait_supported; static inline uint64_t @@ -51,17 +74,29 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + const unsigned int lcore_id = rte_lcore_id(); + struct power_wait_status *s; /* prevent user from running this instruction if it's not supported */ if (!wait_supported) return -ENOTSUP; + /* prevent non-EAL thread from using this API */ + if (lcore_id >= RTE_MAX_LCORE) + return -EINVAL; + if (pmc == NULL) return -EINVAL; if (__check_val_size(pmc->data_sz) < 0) return -EINVAL; + s = &wait_status[lcore_id]; + + /* update sleep address */ + rte_spinlock_lock(&s->lock); + s->monitor_addr = pmc->addr; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -72,21 +107,37 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, : : "D"(pmc->addr)); + /* now that we've put this address into monitor, we can unlock */ + rte_spinlock_unlock(&s->lock); + + /* if we have a comparison mask, we might not need to sleep at all */ if (pmc->mask) { const uint64_t cur_value = __get_umwait_val( pmc->addr, pmc->data_sz); const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == pmc->val) + if (masked == pmc->val) { + /* erase sleep address */ + rte_spinlock_lock(&s->lock); + s->monitor_addr = NULL; + rte_spinlock_unlock(&s->lock); + return 0; + } } + /* execute UMWAIT */ asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" : /* ignore rflags */ : "D"(0), /* enter C0.2 */ "a"(tsc_l), "d"(tsc_h)); + /* erase sleep address */ + rte_spinlock_lock(&s->lock); + s->monitor_addr = NULL; + rte_spinlock_unlock(&s->lock); + return 0; } @@ -122,3 +173,48 @@ RTE_INIT(rte_power_intrinsics_init) { if (i.power_monitor && i.power_pause) wait_supported = 1; } + +int +rte_power_monitor_wakeup(const unsigned int lcore_id) +{ + struct power_wait_status *s; + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return -ENOTSUP; + + /* prevent buffer overrun */ + if (lcore_id >= RTE_MAX_LCORE) + return -EINVAL; + + s = &wait_status[lcore_id]; + + /* + * There is a race condition between sleep, wakeup and locking, but we + * don't need to handle it. + * + * Possible situations: + * + * 1. T1 locks, sets address, unlocks + * 2. T2 locks, triggers wakeup, unlocks + * 3. T1 sleeps + * + * In this case, because T1 has already set the address for monitoring, + * we will wake up immediately even if T2 triggers wakeup before T1 + * goes to sleep. + * + * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up + * 2. T2 locks, triggers wakeup, and unlocks + * 3. T1 locks, erases address, and unlocks + * + * In this case, since we've already woken up, the "wakeup" was + * unneeded, and since T1 is still waiting on T2 releasing the lock, the + * wakeup address is still valid so it's perfectly safe to write it. + */ + rte_spinlock_lock(&s->lock); + if (s->monitor_addr != NULL) + __umwait_wakeup(s->monitor_addr); + rte_spinlock_unlock(&s->lock); + + return 0; +} -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function Anatoly Burakov @ 2021-01-13 12:46 ` Ananyev, Konstantin 0 siblings, 0 replies; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-13 12:46 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman, Richardson, Bruce, thomas, McDaniel, Timothy, Hunt, David, Macnamara, Chris > Now that we have everything in a C file, we can store the information > about our sleep, and have a native mechanism to wake up the sleeping > core. This mechanism would however only wake up a core that's sleeping > while monitoring - waking up from `rte_power_pause` won't work. > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > --- > > Notes: > v16: > - Improve error handling > - Take a lock before UMONITOR > > v13: > - Add comments around wakeup code to explain what it does > - Add lcore_id parameter checking to prevent buffer overrun > > lib/librte_eal/arm/rte_power_intrinsics.c | 9 ++ > .../include/generic/rte_power_intrinsics.h | 16 +++ > lib/librte_eal/ppc/rte_power_intrinsics.c | 9 ++ > lib/librte_eal/version.map | 1 + > lib/librte_eal/x86/rte_power_intrinsics.c | 98 ++++++++++++++++++- > 5 files changed, 132 insertions(+), 1 deletion(-) > > -- Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov ` (4 preceding siblings ...) 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 05/11] eal: add monitor wakeup function Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-13 13:18 ` Ananyev, Konstantin 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback Anatoly Burakov ` (6 subsequent siblings) 12 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Liang Ma, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella, Neil Horman, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add a simple API to allow getting the monitor conditions for power-optimized monitoring of the Rx queues from the PMD, as well as release notes information. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> --- Notes: v13: - Fix typos and issues raised by Andrew doc/guides/rel_notes/release_21_02.rst | 5 +++++ lib/librte_ethdev/rte_ethdev.c | 28 ++++++++++++++++++++++++++ lib/librte_ethdev/rte_ethdev.h | 25 +++++++++++++++++++++++ lib/librte_ethdev/rte_ethdev_driver.h | 22 ++++++++++++++++++++ lib/librte_ethdev/version.map | 3 +++ 5 files changed, 83 insertions(+) diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst index 706cbf8f0c..ec9958a141 100644 --- a/doc/guides/rel_notes/release_21_02.rst +++ b/doc/guides/rel_notes/release_21_02.rst @@ -55,6 +55,11 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **ethdev: added new API for PMD power management** + + * ``rte_eth_get_monitor_addr()``, to be used in conjunction with + ``rte_power_monitor()`` to enable automatic power management for PMD's. + Removed Items ------------- diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c index 17ddacc78d..e19dbd838b 100644 --- a/lib/librte_ethdev/rte_ethdev.c +++ b/lib/librte_ethdev/rte_ethdev.c @@ -5115,6 +5115,34 @@ rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, dev->dev_ops->tx_burst_mode_get(dev, queue_id, mode)); } +int +rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id, + struct rte_power_monitor_cond *pmc) +{ + struct rte_eth_dev *dev; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV); + + dev = &rte_eth_devices[port_id]; + + RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->get_monitor_addr, -ENOTSUP); + + if (queue_id >= dev->data->nb_rx_queues) { + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id); + return -EINVAL; + } + + if (pmc == NULL) { + RTE_ETHDEV_LOG(ERR, "Invalid power monitor condition=%p\n", + pmc); + return -EINVAL; + } + + return eth_err(port_id, + dev->dev_ops->get_monitor_addr(dev->data->rx_queues[queue_id], + pmc)); +} + int rte_eth_dev_set_mc_addr_list(uint16_t port_id, struct rte_ether_addr *mc_addr_set, diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h index f5f8919186..ca0f91312e 100644 --- a/lib/librte_ethdev/rte_ethdev.h +++ b/lib/librte_ethdev/rte_ethdev.h @@ -157,6 +157,7 @@ extern "C" { #include <rte_common.h> #include <rte_config.h> #include <rte_ether.h> +#include <rte_power_intrinsics.h> #include "rte_ethdev_trace_fp.h" #include "rte_dev_info.h" @@ -4334,6 +4335,30 @@ __rte_experimental int rte_eth_tx_burst_mode_get(uint16_t port_id, uint16_t queue_id, struct rte_eth_burst_mode *mode); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Retrieve the monitor condition for a given receive queue. + * + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The Rx queue on the Ethernet device for which information + * will be retrieved. + * @param pmc + * The pointer point to power-optimized monitoring condition structure. + * + * @return + * - 0: Success. + * -ENOTSUP: Operation not supported. + * -EINVAL: Invalid parameters. + * -ENODEV: Invalid port ID. + */ +__rte_experimental +int rte_eth_get_monitor_addr(uint16_t port_id, uint16_t queue_id, + struct rte_power_monitor_cond *pmc); + /** * Retrieve device registers and register attributes (number of registers and * register size) diff --git a/lib/librte_ethdev/rte_ethdev_driver.h b/lib/librte_ethdev/rte_ethdev_driver.h index 0eacfd8425..3b3b0ec1a0 100644 --- a/lib/librte_ethdev/rte_ethdev_driver.h +++ b/lib/librte_ethdev/rte_ethdev_driver.h @@ -763,6 +763,26 @@ typedef int (*eth_hairpin_queue_peer_unbind_t) (struct rte_eth_dev *dev, uint16_t cur_queue, uint32_t direction); /**< @internal Unbind peer queue from the current queue. */ +/** + * @internal + * Get address of memory location whose contents will change whenever there is + * new data to be received on an Rx queue. + * + * @param rxq + * Ethdev queue pointer. + * @param pmc + * The pointer to power-optimized monitoring condition structure. + * @return + * Negative errno value on error, 0 on success. + * + * @retval 0 + * Success + * @retval -EINVAL + * Invalid parameters + */ +typedef int (*eth_get_monitor_addr_t)(void *rxq, + struct rte_power_monitor_cond *pmc); + /** * @internal A structure containing the functions exported by an Ethernet driver. */ @@ -917,6 +937,8 @@ struct eth_dev_ops { /**< Set up the connection between the pair of hairpin queues. */ eth_hairpin_queue_peer_unbind_t hairpin_queue_peer_unbind; /**< Disconnect the hairpin queues of a pair from each other. */ + eth_get_monitor_addr_t get_monitor_addr; + /**< Get power monitoring condition for Rx queue. */ }; /** diff --git a/lib/librte_ethdev/version.map b/lib/librte_ethdev/version.map index d3f5410806..a124e1e370 100644 --- a/lib/librte_ethdev/version.map +++ b/lib/librte_ethdev/version.map @@ -240,6 +240,9 @@ EXPERIMENTAL { rte_flow_get_restore_info; rte_flow_tunnel_action_decap_release; rte_flow_tunnel_item_release; + + # added in 21.02 + rte_eth_get_monitor_addr; }; INTERNAL { -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API Anatoly Burakov @ 2021-01-13 13:18 ` Ananyev, Konstantin 0 siblings, 0 replies; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-13 13:18 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: Ma, Liang J, Thomas Monjalon, Yigit, Ferruh, Andrew Rybchenko, Ray Kinsella, Neil Horman, McDaniel, Timothy, Hunt, David, Richardson, Bruce, Macnamara, Chris > > Add a simple API to allow getting the monitor conditions for > power-optimized monitoring of the Rx queues from the PMD, as well as > release notes information. > > Signed-off-by: Liang Ma <liang.j.ma@intel.com> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> > --- > > Notes: > v13: > - Fix typos and issues raised by Andrew > > doc/guides/rel_notes/release_21_02.rst | 5 +++++ > lib/librte_ethdev/rte_ethdev.c | 28 ++++++++++++++++++++++++++ > lib/librte_ethdev/rte_ethdev.h | 25 +++++++++++++++++++++++ > lib/librte_ethdev/rte_ethdev_driver.h | 22 ++++++++++++++++++++ > lib/librte_ethdev/version.map | 3 +++ > 5 files changed, 83 insertions(+) > -- Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov ` (5 preceding siblings ...) 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 06/11] ethdev: add simple power management API Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-13 12:58 ` Ananyev, Konstantin 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 08/11] net/ixgbe: implement power management API Anatoly Burakov ` (5 subsequent siblings) 12 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Liang Ma, David Hunt, Ray Kinsella, Neil Horman, thomas, konstantin.ananyev, timothy.mcdaniel, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add a simple on/off switch that will enable saving power when no packets are arriving. It is based on counting the number of empty polls and, when the number reaches a certain threshold, entering an architecture-defined optimized power state that will either wait until a TSC timestamp expires, or when packets arrive. This API mandates a core-to-single-queue mapping (that is, multiple queued per device are supported, but they have to be polled on different cores). This design is using PMD RX callbacks. 1. UMWAIT/UMONITOR: When a certain threshold of empty polls is reached, the core will go into a power optimized sleep while waiting on an address of next RX descriptor to be written to. 2. TPAUSE/Pause instruction This method uses the pause (or TPAUSE, if available) instruction to avoid busy polling. 3. Frequency scaling Reuse existing DPDK power library to scale up/down core frequency depending on traffic volume. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v15: - Fix check in UMWAIT callback v13: - Rework the synchronization mechanism to not require locking - Add more parameter checking - Rework n_rx_queues access to not go through internal PMD structures and use public API instead v13: - Rework the synchronization mechanism to not require locking - Add more parameter checking - Rework n_rx_queues access to not go through internal PMD structures and use public API instead doc/guides/prog_guide/power_man.rst | 44 +++ doc/guides/rel_notes/release_21_02.rst | 10 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 359 +++++++++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++++ lib/librte_power/version.map | 5 + 6 files changed, 511 insertions(+), 2 deletions(-) create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index 0a3755a901..02280dd689 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -192,6 +192,47 @@ User Cases ---------- The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA. +PMD Power Management API +------------------------ + +Abstract +~~~~~~~~ +Existing power management mechanisms require developers to change application +design or change code to make use of it. The PMD power management API provides a +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering +power saving whenever empty poll count reaches a certain number. + + * Monitor + + This power saving scheme will put the CPU into optimized power state and use + the ``rte_power_monitor()`` function to monitor the Ethernet PMD RX + descriptor address, and wake the CPU up whenever there's new traffic. + + * Pause + + This power saving scheme will avoid busy polling by either entering + power-optimized sleep state with ``rte_power_pause()`` function, or, if it's + not available, use ``rte_pause()``. + + * Frequency scaling + + This power saving scheme will use existing ``librte_power`` library + functionality to scale the core frequency up/down depending on traffic + volume. + + +.. note:: + + Currently, this power management API is limited to mandatory mapping of 1 + queue to 1 core (multiple queues are supported, but they must be polled from + different cores). + +API Overview for PMD Power Management +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +* **Queue Enable**: Enable specific power scheme for certain queue/port/core + +* **Queue Disable**: Disable power scheme for certain queue/port/core + References ---------- @@ -200,3 +241,6 @@ References * The :doc:`../sample_app_ug/vm_power_management` chapter in the :doc:`../sample_app_ug/index` section. + +* The :doc:`../sample_app_ug/rxtx_callbacks` + chapter in the :doc:`../sample_app_ug/index` section. diff --git a/doc/guides/rel_notes/release_21_02.rst b/doc/guides/rel_notes/release_21_02.rst index ec9958a141..9cd8214e2d 100644 --- a/doc/guides/rel_notes/release_21_02.rst +++ b/doc/guides/rel_notes/release_21_02.rst @@ -60,6 +60,16 @@ New Features * ``rte_eth_get_monitor_addr()``, to be used in conjunction with ``rte_power_monitor()`` to enable automatic power management for PMD's. +* **Add PMD power management helper API** + + A new helper API has been added to make using Ethernet PMD power management + easier for the user: ``rte_power_pmd_mgmt_queue_enable()``. Three power + management schemes are supported initially: + + * Power saving based on UMWAIT instruction (x86 only) + * Power saving based on ``rte_pause()`` (generic) or TPAUSE instruction (x86 only) + * Power saving based on frequency scaling through the ``librte_power`` library + Removed Items ------------- diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build index 4b4cf1b90b..51a471b669 100644 --- a/lib/librte_power/meson.build +++ b/lib/librte_power/meson.build @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c', 'power_kvm_vm.c', 'guest_channel.c', 'rte_power_empty_poll.c', 'power_pstate_cpufreq.c', + 'rte_power_pmd_mgmt.c', 'power_common.c') -headers = files('rte_power.h','rte_power_empty_poll.h') -deps += ['timer'] +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h') +deps += ['timer' ,'ethdev'] diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c new file mode 100644 index 0000000000..470c3a912b --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.c @@ -0,0 +1,359 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include <rte_lcore.h> +#include <rte_cycles.h> +#include <rte_cpuflags.h> +#include <rte_malloc.h> +#include <rte_ethdev.h> +#include <rte_power_intrinsics.h> + +#include "rte_power_pmd_mgmt.h" + +#define EMPTYPOLL_MAX 512 + +static struct pmd_conf_data { + struct rte_cpu_intrinsics intrinsics_support; + /**< what do we support? */ + uint64_t tsc_per_us; + /**< pre-calculated tsc diff for 1us */ + uint64_t pause_per_us; + /**< how many rte_pause can we fit in a microisecond? */ +} global_data; + +/** + * Possible power management states of an ethdev port. + */ +enum pmd_mgmt_state { + /** Device power management is disabled. */ + PMD_MGMT_DISABLED = 0, + /** Device power management is enabled. */ + PMD_MGMT_ENABLED, + /** Device powermanagement status is about to change. */ + PMD_MGMT_BUSY +}; + +struct pmd_queue_cfg { + volatile enum pmd_mgmt_state pwr_mgmt_state; + /**< State of power management for this queue */ + enum rte_power_pmd_mgmt_type cb_mode; + /**< Callback mode for this queue */ + const struct rte_eth_rxtx_callback *cur_cb; + /**< Callback instance */ + volatile bool umwait_in_progress; + /**< are we currently sleeping? */ + uint64_t empty_poll_stats; + /**< Number of empty polls */ +} __rte_cache_aligned; + +static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT]; + +static void +calc_tsc(void) +{ + const uint64_t hz = rte_get_timer_hz(); + const uint64_t tsc_per_us = hz / US_PER_S; /* 1us */ + + global_data.tsc_per_us = tsc_per_us; + + /* only do this if we don't have tpause */ + if (!global_data.intrinsics_support.power_pause) { + const uint64_t start = rte_rdtsc_precise(); + const uint32_t n_pauses = 10000; + double us, us_per_pause; + uint64_t end; + unsigned int i; + + /* estimate number of rte_pause() calls per us*/ + for (i = 0; i < n_pauses; i++) + rte_pause(); + + end = rte_rdtsc_precise(); + us = (end - start) / (double)tsc_per_us; + us_per_pause = us / n_pauses; + + global_data.pause_per_us = (uint64_t)(1.0 / us_per_pause); + } +} + +static uint16_t +clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, + uint16_t nb_rx, uint16_t max_pkts __rte_unused, + void *addr __rte_unused) +{ + + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + struct rte_power_monitor_cond pmc; + uint16_t ret; + + /* + * we might get a cancellation request while being + * inside the callback, in which case the wakeup + * wouldn't work because it would've arrived too early. + * + * to get around this, we notify the other thread that + * we're sleeping, so that it can spin until we're done. + * unsolicited wakeups are perfectly safe. + */ + q_conf->umwait_in_progress = true; + + /* check if we need to cancel sleep */ + if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) { + /* use monitoring condition to sleep */ + ret = rte_eth_get_monitor_addr(port_id, qidx, + &pmc); + if (ret == 0) + rte_power_monitor(&pmc, -1ULL); + } + q_conf->umwait_in_progress = false; + } + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, + uint16_t nb_rx, uint16_t max_pkts __rte_unused, + void *addr __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + /* sleep for 1 microsecond */ + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + /* use tpause if we have it */ + if (global_data.intrinsics_support.power_pause) { + const uint64_t cur = rte_rdtsc(); + const uint64_t wait_tsc = + cur + global_data.tsc_per_us; + rte_power_pause(wait_tsc); + } else { + uint64_t i; + for (i = 0; i < global_data.pause_per_us; i++) + rte_pause(); + } + } + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_scale_freq(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *_ __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + /* scale down freq */ + rte_power_freq_min(rte_lcore_id()); + } else { + q_conf->empty_poll_stats = 0; + /* scale up freq */ + rte_power_freq_max(rte_lcore_id()); + } + + return nb_rx; +} + +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, + uint16_t queue_id, enum rte_power_pmd_mgmt_type mode) +{ + struct pmd_queue_cfg *queue_cfg; + struct rte_eth_dev_info info; + int ret; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + + if (queue_id >= RTE_MAX_QUEUES_PER_PORT || lcore_id >= RTE_MAX_LCORE) { + ret = -EINVAL; + goto end; + } + + if (rte_eth_dev_info_get(port_id, &info) < 0) { + ret = -EINVAL; + goto end; + } + + /* check if queue id is valid */ + if (queue_id >= info.nb_rx_queues) { + ret = -EINVAL; + goto end; + } + + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) { + ret = -EINVAL; + goto end; + } + + /* we're about to change our state */ + queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; + + /* we need this in various places */ + rte_cpu_get_intrinsics_support(&global_data.intrinsics_support); + + switch (mode) { + case RTE_POWER_MGMT_TYPE_MONITOR: + { + struct rte_power_monitor_cond dummy; + + /* check if rte_power_monitor is supported */ + if (!global_data.intrinsics_support.power_monitor) { + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); + ret = -ENOTSUP; + goto rollback; + } + + /* check if the device supports the necessary PMD API */ + if (rte_eth_get_monitor_addr(port_id, queue_id, + &dummy) == -ENOTSUP) { + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n"); + ret = -ENOTSUP; + goto rollback; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->umwait_in_progress = false; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_umwait, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_SCALE: + { + enum power_management_env env; + /* only PSTATE and ACPI modes are supported */ + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && + !rte_power_check_env_supported( + PM_ENV_PSTATE_CPUFREQ)) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); + ret = -ENOTSUP; + goto rollback; + } + /* ensure we could initialize the power library */ + if (rte_power_init(lcore_id)) { + ret = -EINVAL; + goto rollback; + } + /* ensure we initialized the correct env */ + env = rte_power_get_env(); + if (env != PM_ENV_ACPI_CPUFREQ && + env != PM_ENV_PSTATE_CPUFREQ) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); + ret = -ENOTSUP; + goto rollback; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, + queue_id, clb_scale_freq, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_PAUSE: + /* figure out various time-to-tsc conversions */ + if (global_data.tsc_per_us == 0) + calc_tsc(); + + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_pause, NULL); + break; + } + ret = 0; + + return ret; + +rollback: + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; +end: + return ret; +} + +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id) +{ + struct pmd_queue_cfg *queue_cfg; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + + if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT) + return -EINVAL; + + /* no need to check queue id as wrong queue id would not be enabled */ + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) + return -EINVAL; + + /* let the callback know we're shutting down */ + queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; + + switch (queue_cfg->cb_mode) { + case RTE_POWER_MGMT_TYPE_MONITOR: + { + bool exit = false; + do { + /* + * we may request cancellation while the other thread + * has just entered the callback but hasn't started + * sleeping yet, so keep waking it up until we know it's + * done sleeping. + */ + if (queue_cfg->umwait_in_progress) + rte_power_monitor_wakeup(lcore_id); + else + exit = true; + } while (!exit); + } + /* fall-through */ + case RTE_POWER_MGMT_TYPE_PAUSE: + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + break; + case RTE_POWER_MGMT_TYPE_SCALE: + rte_power_freq_max(lcore_id); + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + rte_power_exit(lcore_id); + break; + } + /* + * we don't free the RX callback here because it is unsafe to do so + * unless we know for a fact that all data plane threads have stopped. + */ + queue_cfg->cur_cb = NULL; + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + + return 0; +} diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h new file mode 100644 index 0000000000..0bfbc6ba69 --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#ifndef _RTE_POWER_PMD_MGMT_H +#define _RTE_POWER_PMD_MGMT_H + +/** + * @file + * RTE PMD Power Management + */ +#include <stdint.h> +#include <stdbool.h> + +#include <rte_common.h> +#include <rte_byteorder.h> +#include <rte_log.h> +#include <rte_power.h> +#include <rte_atomic.h> + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * PMD Power Management Type + */ +enum rte_power_pmd_mgmt_type { + /** Use power-optimized monitoring to wait for incoming traffic */ + RTE_POWER_MGMT_TYPE_MONITOR = 1, + /** Use power-optimized sleep to avoid busy polling */ + RTE_POWER_MGMT_TYPE_PAUSE, + /** Use frequency scaling when traffic is low */ + RTE_POWER_MGMT_TYPE_SCALE, +}; + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Enable power management on a specified RX queue and lcore. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @param mode + * The power management callback function type. + + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode); + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Disable power management on a specified RX queue and lcore. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id); +#ifdef __cplusplus +} +#endif + +#endif diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map index 69ca9af616..61996b4d11 100644 --- a/lib/librte_power/version.map +++ b/lib/librte_power/version.map @@ -34,4 +34,9 @@ EXPERIMENTAL { rte_power_guest_channel_receive_msg; rte_power_poll_stat_fetch; rte_power_poll_stat_update; + + # added in 21.02 + rte_power_pmd_mgmt_queue_enable; + rte_power_pmd_mgmt_queue_disable; + }; -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback Anatoly Burakov @ 2021-01-13 12:58 ` Ananyev, Konstantin 2021-01-13 17:29 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Ananyev, Konstantin @ 2021-01-13 12:58 UTC (permalink / raw) To: Burakov, Anatoly, dev Cc: Ma, Liang J, Hunt, David, Ray Kinsella, Neil Horman, thomas, McDaniel, Timothy, Richardson, Bruce, Macnamara, Chris > -----Original Message----- > From: Burakov, Anatoly <anatoly.burakov@intel.com> > Sent: Tuesday, January 12, 2021 5:37 PM > To: dev@dpdk.org > Cc: Ma, Liang J <liang.j.ma@intel.com>; Hunt, David <david.hunt@intel.com>; Ray Kinsella <mdr@ashroe.eu>; Neil Horman > <nhorman@tuxdriver.com>; thomas@monjalon.net; Ananyev, Konstantin <konstantin.ananyev@intel.com>; McDaniel, Timothy > <timothy.mcdaniel@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Macnamara, Chris <chris.macnamara@intel.com> > Subject: [PATCH v16 07/11] power: add PMD power management API and callback > > From: Liang Ma <liang.j.ma@intel.com> > > Add a simple on/off switch that will enable saving power when no > packets are arriving. It is based on counting the number of empty > polls and, when the number reaches a certain threshold, entering an > architecture-defined optimized power state that will either wait > until a TSC timestamp expires, or when packets arrive. > > This API mandates a core-to-single-queue mapping (that is, multiple > queued per device are supported, but they have to be polled on different > cores). > > This design is using PMD RX callbacks. > > 1. UMWAIT/UMONITOR: > > When a certain threshold of empty polls is reached, the core will go > into a power optimized sleep while waiting on an address of next RX > descriptor to be written to. > > 2. TPAUSE/Pause instruction > > This method uses the pause (or TPAUSE, if available) instruction to > avoid busy polling. > > 3. Frequency scaling > Reuse existing DPDK power library to scale up/down core frequency > depending on traffic volume. > > Signed-off-by: Liang Ma <liang.j.ma@intel.com> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> > --- > > Notes: > v15: > - Fix check in UMWAIT callback > > v13: > - Rework the synchronization mechanism to not require locking > - Add more parameter checking > - Rework n_rx_queues access to not go through internal PMD structures and use > public API instead > > v13: > - Rework the synchronization mechanism to not require locking > - Add more parameter checking > - Rework n_rx_queues access to not go through internal PMD structures and use > public API instead > > doc/guides/prog_guide/power_man.rst | 44 +++ > doc/guides/rel_notes/release_21_02.rst | 10 + > lib/librte_power/meson.build | 5 +- > lib/librte_power/rte_power_pmd_mgmt.c | 359 +++++++++++++++++++++++++ > lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++++ > lib/librte_power/version.map | 5 + > 6 files changed, 511 insertions(+), 2 deletions(-) > create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c > create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h > ... > + > +static uint16_t > +clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, > + uint16_t nb_rx, uint16_t max_pkts __rte_unused, > + void *addr __rte_unused) > +{ > + > + struct pmd_queue_cfg *q_conf; > + > + q_conf = &port_cfg[port_id][qidx]; > + > + if (unlikely(nb_rx == 0)) { > + q_conf->empty_poll_stats++; > + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { > + struct rte_power_monitor_cond pmc; > + uint16_t ret; > + > + /* > + * we might get a cancellation request while being > + * inside the callback, in which case the wakeup > + * wouldn't work because it would've arrived too early. > + * > + * to get around this, we notify the other thread that > + * we're sleeping, so that it can spin until we're done. > + * unsolicited wakeups are perfectly safe. > + */ > + q_conf->umwait_in_progress = true; This write and subsequent read can be reordered by the cpu. I think you need rte_atomic_thread_fence(__ATOMIC_SEQ_CST) here and in disable() code-path below. > + > + /* check if we need to cancel sleep */ > + if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) { > + /* use monitoring condition to sleep */ > + ret = rte_eth_get_monitor_addr(port_id, qidx, > + &pmc); > + if (ret == 0) > + rte_power_monitor(&pmc, -1ULL); > + } > + q_conf->umwait_in_progress = false; > + } > + } else > + q_conf->empty_poll_stats = 0; > + > + return nb_rx; > +} > + ... > + > +int > +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, > + uint16_t port_id, uint16_t queue_id) > +{ > + struct pmd_queue_cfg *queue_cfg; > + > + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); > + > + if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT) > + return -EINVAL; > + > + /* no need to check queue id as wrong queue id would not be enabled */ > + queue_cfg = &port_cfg[port_id][queue_id]; > + > + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) > + return -EINVAL; > + > + /* let the callback know we're shutting down */ > + queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; Same as above - write to pwr_mgmt_state and read from umwait_in_progress could be reordered by cpu. Need to insert rte_atomic_thread_fence(__ATOMIC_SEQ_CST) between them. BTW, out of curiosity - why do you need this intermediate state (PMD_MGMT_BUSY) at all? Why not directly: queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; ? > + > + switch (queue_cfg->cb_mode) { > + case RTE_POWER_MGMT_TYPE_MONITOR: > + { > + bool exit = false; > + do { > + /* > + * we may request cancellation while the other thread > + * has just entered the callback but hasn't started > + * sleeping yet, so keep waking it up until we know it's > + * done sleeping. > + */ > + if (queue_cfg->umwait_in_progress) > + rte_power_monitor_wakeup(lcore_id); > + else > + exit = true; > + } while (!exit); > + } > + /* fall-through */ > + case RTE_POWER_MGMT_TYPE_PAUSE: > + rte_eth_remove_rx_callback(port_id, queue_id, > + queue_cfg->cur_cb); > + break; > + case RTE_POWER_MGMT_TYPE_SCALE: > + rte_power_freq_max(lcore_id); > + rte_eth_remove_rx_callback(port_id, queue_id, > + queue_cfg->cur_cb); > + rte_power_exit(lcore_id); > + break; > + } > + /* > + * we don't free the RX callback here because it is unsafe to do so > + * unless we know for a fact that all data plane threads have stopped. > + */ > + queue_cfg->cur_cb = NULL; > + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; > + > + return 0; > +} > diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h > new file mode 100644 > index 0000000000..0bfbc6ba69 > --- /dev/null > +++ b/lib/librte_power/rte_power_pmd_mgmt.h > @@ -0,0 +1,90 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2010-2020 Intel Corporation > + */ > + > +#ifndef _RTE_POWER_PMD_MGMT_H > +#define _RTE_POWER_PMD_MGMT_H > + > +/** > + * @file > + * RTE PMD Power Management > + */ > +#include <stdint.h> > +#include <stdbool.h> > + > +#include <rte_common.h> > +#include <rte_byteorder.h> > +#include <rte_log.h> > +#include <rte_power.h> > +#include <rte_atomic.h> > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +/** > + * PMD Power Management Type > + */ > +enum rte_power_pmd_mgmt_type { > + /** Use power-optimized monitoring to wait for incoming traffic */ > + RTE_POWER_MGMT_TYPE_MONITOR = 1, > + /** Use power-optimized sleep to avoid busy polling */ > + RTE_POWER_MGMT_TYPE_PAUSE, > + /** Use frequency scaling when traffic is low */ > + RTE_POWER_MGMT_TYPE_SCALE, > +}; > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice > + * > + * Enable power management on a specified RX queue and lcore. > + * > + * @note This function is not thread-safe. > + * > + * @param lcore_id > + * lcore_id. > + * @param port_id > + * The port identifier of the Ethernet device. > + * @param queue_id > + * The queue identifier of the Ethernet device. > + * @param mode > + * The power management callback function type. > + > + * @return > + * 0 on success > + * <0 on error > + */ > +__rte_experimental > +int > +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, > + uint16_t port_id, uint16_t queue_id, > + enum rte_power_pmd_mgmt_type mode); > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice > + * > + * Disable power management on a specified RX queue and lcore. > + * > + * @note This function is not thread-safe. > + * > + * @param lcore_id > + * lcore_id. > + * @param port_id > + * The port identifier of the Ethernet device. > + * @param queue_id > + * The queue identifier of the Ethernet device. > + * @return > + * 0 on success > + * <0 on error > + */ > +__rte_experimental > +int > +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, > + uint16_t port_id, uint16_t queue_id); > +#ifdef __cplusplus > +} > +#endif > + > +#endif > diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map > index 69ca9af616..61996b4d11 100644 > --- a/lib/librte_power/version.map > +++ b/lib/librte_power/version.map > @@ -34,4 +34,9 @@ EXPERIMENTAL { > rte_power_guest_channel_receive_msg; > rte_power_poll_stat_fetch; > rte_power_poll_stat_update; > + > + # added in 21.02 > + rte_power_pmd_mgmt_queue_enable; > + rte_power_pmd_mgmt_queue_disable; > + > }; > -- > 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback 2021-01-13 12:58 ` Ananyev, Konstantin @ 2021-01-13 17:29 ` Burakov, Anatoly 2021-01-14 13:00 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-13 17:29 UTC (permalink / raw) To: Ananyev, Konstantin, dev Cc: Ma, Liang J, Hunt, David, Ray Kinsella, Neil Horman, thomas, McDaniel, Timothy, Richardson, Bruce, Macnamara, Chris On 13-Jan-21 12:58 PM, Ananyev, Konstantin wrote: > > >> -----Original Message----- >> From: Burakov, Anatoly <anatoly.burakov@intel.com> >> Sent: Tuesday, January 12, 2021 5:37 PM >> To: dev@dpdk.org >> Cc: Ma, Liang J <liang.j.ma@intel.com>; Hunt, David <david.hunt@intel.com>; Ray Kinsella <mdr@ashroe.eu>; Neil Horman >> <nhorman@tuxdriver.com>; thomas@monjalon.net; Ananyev, Konstantin <konstantin.ananyev@intel.com>; McDaniel, Timothy >> <timothy.mcdaniel@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Macnamara, Chris <chris.macnamara@intel.com> >> Subject: [PATCH v16 07/11] power: add PMD power management API and callback >> >> From: Liang Ma <liang.j.ma@intel.com> >> >> Add a simple on/off switch that will enable saving power when no >> packets are arriving. It is based on counting the number of empty >> polls and, when the number reaches a certain threshold, entering an >> architecture-defined optimized power state that will either wait >> until a TSC timestamp expires, or when packets arrive. >> >> This API mandates a core-to-single-queue mapping (that is, multiple >> queued per device are supported, but they have to be polled on different >> cores). >> >> This design is using PMD RX callbacks. >> >> 1. UMWAIT/UMONITOR: >> >> When a certain threshold of empty polls is reached, the core will go >> into a power optimized sleep while waiting on an address of next RX >> descriptor to be written to. >> >> 2. TPAUSE/Pause instruction >> >> This method uses the pause (or TPAUSE, if available) instruction to >> avoid busy polling. >> >> 3. Frequency scaling >> Reuse existing DPDK power library to scale up/down core frequency >> depending on traffic volume. >> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> >> --- >> >> Notes: >> v15: >> - Fix check in UMWAIT callback >> >> v13: >> - Rework the synchronization mechanism to not require locking >> - Add more parameter checking >> - Rework n_rx_queues access to not go through internal PMD structures and use >> public API instead >> >> v13: >> - Rework the synchronization mechanism to not require locking >> - Add more parameter checking >> - Rework n_rx_queues access to not go through internal PMD structures and use >> public API instead >> >> doc/guides/prog_guide/power_man.rst | 44 +++ >> doc/guides/rel_notes/release_21_02.rst | 10 + >> lib/librte_power/meson.build | 5 +- >> lib/librte_power/rte_power_pmd_mgmt.c | 359 +++++++++++++++++++++++++ >> lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++++ >> lib/librte_power/version.map | 5 + >> 6 files changed, 511 insertions(+), 2 deletions(-) >> create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c >> create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h >> > > ... > >> + >> +static uint16_t >> +clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, >> +uint16_t nb_rx, uint16_t max_pkts __rte_unused, >> +void *addr __rte_unused) >> +{ >> + >> +struct pmd_queue_cfg *q_conf; >> + >> +q_conf = &port_cfg[port_id][qidx]; >> + >> +if (unlikely(nb_rx == 0)) { >> +q_conf->empty_poll_stats++; >> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { >> +struct rte_power_monitor_cond pmc; >> +uint16_t ret; >> + >> +/* >> + * we might get a cancellation request while being >> + * inside the callback, in which case the wakeup >> + * wouldn't work because it would've arrived too early. >> + * >> + * to get around this, we notify the other thread that >> + * we're sleeping, so that it can spin until we're done. >> + * unsolicited wakeups are perfectly safe. >> + */ >> +q_conf->umwait_in_progress = true; > > This write and subsequent read can be reordered by the cpu. > I think you need rte_atomic_thread_fence(__ATOMIC_SEQ_CST) here and > in disable() code-path below. > >> + >> +/* check if we need to cancel sleep */ >> +if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) { >> +/* use monitoring condition to sleep */ >> +ret = rte_eth_get_monitor_addr(port_id, qidx, >> +&pmc); >> +if (ret == 0) >> +rte_power_monitor(&pmc, -1ULL); >> +} >> +q_conf->umwait_in_progress = false; >> +} >> +} else >> +q_conf->empty_poll_stats = 0; >> + >> +return nb_rx; >> +} >> + > > ... > >> + >> +int >> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, >> +uint16_t port_id, uint16_t queue_id) >> +{ >> +struct pmd_queue_cfg *queue_cfg; >> + >> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); >> + >> +if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT) >> +return -EINVAL; >> + >> +/* no need to check queue id as wrong queue id would not be enabled */ >> +queue_cfg = &port_cfg[port_id][queue_id]; >> + >> +if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) >> +return -EINVAL; >> + >> +/* let the callback know we're shutting down */ >> +queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; > > Same as above - write to pwr_mgmt_state and read from umwait_in_progress > could be reordered by cpu. > Need to insert rte_atomic_thread_fence(__ATOMIC_SEQ_CST) between them. > > BTW, out of curiosity - why do you need this intermediate > state (PMD_MGMT_BUSY) at all? > Why not directly: > queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; > ? > Thanks for suggestions, i'll add those. The goal for the "intermediate" step is to prevent Rx callback from sleeping in the first place. We can't "wake up" earlier than it goes to sleep, but we may get a request to disable power management while we're at the beginning of the callback and haven't yet entered the rte_power_monitor code. In this case, setting it to "BUSY" will prevent the callback from ever sleeping in the first place (see rte_power_pmd_mgmt:108 check), and will unset the "umwait in progress" if there was any. So, we have three situations to handle: 1) wake up during umwait 2) "wake up" during callback after we've set the "umwait in progress" flag but before actual umwait happens - we don't wait to exit before we're sure there's nothing sleeping there 3) "wake up" during callback before we set the "umwait in progress" flag 1) is handled by the rte_power_monitor_wakeup() call, so that's taken care of. 2) is handled by the other thread waiting on "umwait in progress" becoming false. 3) is handled by having this BUSY check in the umwait thread. Hope that made sense! -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback 2021-01-13 17:29 ` Burakov, Anatoly @ 2021-01-14 13:00 ` Burakov, Anatoly 0 siblings, 0 replies; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-14 13:00 UTC (permalink / raw) To: Ananyev, Konstantin, dev Cc: Ma, Liang J, Hunt, David, Ray Kinsella, Neil Horman, thomas, McDaniel, Timothy, Richardson, Bruce, Macnamara, Chris On 13-Jan-21 5:29 PM, Burakov, Anatoly wrote: > On 13-Jan-21 12:58 PM, Ananyev, Konstantin wrote: >> >> >>> -----Original Message----- >>> From: Burakov, Anatoly <anatoly.burakov@intel.com> >>> Sent: Tuesday, January 12, 2021 5:37 PM >>> To: dev@dpdk.org >>> Cc: Ma, Liang J <liang.j.ma@intel.com>; Hunt, David >>> <david.hunt@intel.com>; Ray Kinsella <mdr@ashroe.eu>; Neil Horman >>> <nhorman@tuxdriver.com>; thomas@monjalon.net; Ananyev, Konstantin >>> <konstantin.ananyev@intel.com>; McDaniel, Timothy >>> <timothy.mcdaniel@intel.com>; Richardson, Bruce >>> <bruce.richardson@intel.com>; Macnamara, Chris >>> <chris.macnamara@intel.com> >>> Subject: [PATCH v16 07/11] power: add PMD power management API and >>> callback >>> >>> From: Liang Ma <liang.j.ma@intel.com> >>> >>> Add a simple on/off switch that will enable saving power when no >>> packets are arriving. It is based on counting the number of empty >>> polls and, when the number reaches a certain threshold, entering an >>> architecture-defined optimized power state that will either wait >>> until a TSC timestamp expires, or when packets arrive. >>> >>> This API mandates a core-to-single-queue mapping (that is, multiple >>> queued per device are supported, but they have to be polled on different >>> cores). >>> >>> This design is using PMD RX callbacks. >>> >>> 1. UMWAIT/UMONITOR: >>> >>> When a certain threshold of empty polls is reached, the core will go >>> into a power optimized sleep while waiting on an address of next RX >>> descriptor to be written to. >>> >>> 2. TPAUSE/Pause instruction >>> >>> This method uses the pause (or TPAUSE, if available) instruction to >>> avoid busy polling. >>> >>> 3. Frequency scaling >>> Reuse existing DPDK power library to scale up/down core frequency >>> depending on traffic volume. >>> >>> Signed-off-by: Liang Ma <liang.j.ma@intel.com> >>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> >>> --- >>> >>> Notes: >>> v15: >>> - Fix check in UMWAIT callback >>> >>> v13: >>> - Rework the synchronization mechanism to not require locking >>> - Add more parameter checking >>> - Rework n_rx_queues access to not go through internal PMD >>> structures and use >>> public API instead >>> >>> v13: >>> - Rework the synchronization mechanism to not require locking >>> - Add more parameter checking >>> - Rework n_rx_queues access to not go through internal PMD >>> structures and use >>> public API instead >>> >>> doc/guides/prog_guide/power_man.rst | 44 +++ >>> doc/guides/rel_notes/release_21_02.rst | 10 + >>> lib/librte_power/meson.build | 5 +- >>> lib/librte_power/rte_power_pmd_mgmt.c | 359 +++++++++++++++++++++++++ >>> lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++++ >>> lib/librte_power/version.map | 5 + >>> 6 files changed, 511 insertions(+), 2 deletions(-) >>> create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c >>> create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h >>> >> >> ... >> >>> + >>> +static uint16_t >>> +clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts >>> __rte_unused, >>> +uint16_t nb_rx, uint16_t max_pkts __rte_unused, >>> +void *addr __rte_unused) >>> +{ >>> + >>> +struct pmd_queue_cfg *q_conf; >>> + >>> +q_conf = &port_cfg[port_id][qidx]; >>> + >>> +if (unlikely(nb_rx == 0)) { >>> +q_conf->empty_poll_stats++; >>> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { >>> +struct rte_power_monitor_cond pmc; >>> +uint16_t ret; >>> + >>> +/* >>> + * we might get a cancellation request while being >>> + * inside the callback, in which case the wakeup >>> + * wouldn't work because it would've arrived too early. >>> + * >>> + * to get around this, we notify the other thread that >>> + * we're sleeping, so that it can spin until we're done. >>> + * unsolicited wakeups are perfectly safe. >>> + */ >>> +q_conf->umwait_in_progress = true; >> >> This write and subsequent read can be reordered by the cpu. >> I think you need rte_atomic_thread_fence(__ATOMIC_SEQ_CST) here and >> in disable() code-path below. >> >>> + >>> +/* check if we need to cancel sleep */ >>> +if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) { >>> +/* use monitoring condition to sleep */ >>> +ret = rte_eth_get_monitor_addr(port_id, qidx, >>> +&pmc); >>> +if (ret == 0) >>> +rte_power_monitor(&pmc, -1ULL); >>> +} >>> +q_conf->umwait_in_progress = false; >>> +} >>> +} else >>> +q_conf->empty_poll_stats = 0; >>> + >>> +return nb_rx; >>> +} >>> + >> >> ... >> >>> + >>> +int >>> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, >>> +uint16_t port_id, uint16_t queue_id) >>> +{ >>> +struct pmd_queue_cfg *queue_cfg; >>> + >>> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); >>> + >>> +if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT) >>> +return -EINVAL; >>> + >>> +/* no need to check queue id as wrong queue id would not be enabled */ >>> +queue_cfg = &port_cfg[port_id][queue_id]; >>> + >>> +if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) >>> +return -EINVAL; >>> + >>> +/* let the callback know we're shutting down */ >>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY; >> >> Same as above - write to pwr_mgmt_state and read from umwait_in_progress >> could be reordered by cpu. >> Need to insert rte_atomic_thread_fence(__ATOMIC_SEQ_CST) between them. >> >> BTW, out of curiosity - why do you need this intermediate >> state (PMD_MGMT_BUSY) at all? >> Why not directly: >> queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; >> ? >> > > Thanks for suggestions, i'll add those. > > The goal for the "intermediate" step is to prevent Rx callback from > sleeping in the first place. We can't "wake up" earlier than it goes to > sleep, but we may get a request to disable power management while we're > at the beginning of the callback and haven't yet entered the > rte_power_monitor code. > > In this case, setting it to "BUSY" will prevent the callback from ever > sleeping in the first place (see rte_power_pmd_mgmt:108 check), and will > unset the "umwait in progress" if there was any. > > So, we have three situations to handle: > > 1) wake up during umwait > 2) "wake up" during callback after we've set the "umwait in progress" > flag but before actual umwait happens - we don't wait to exit before > we're sure there's nothing sleeping there > 3) "wake up" during callback before we set the "umwait in progress" flag > > 1) is handled by the rte_power_monitor_wakeup() call, so that's taken > care of. 2) is handled by the other thread waiting on "umwait in > progress" becoming false. 3) is handled by having this BUSY check in the > umwait thread. > > Hope that made sense! > On further thoughts, the "BUSY" thing relies on a hidden assumption that enable/disable power management per queue is supposed to be thread safe. If we let go of this assumption, we can get by with just enable/disable, so i think i'll just document the thread safety and leave out the "BUSY" part. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 08/11] net/ixgbe: implement power management API 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov ` (6 preceding siblings ...) 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 09/11] net/i40e: " Anatoly Burakov ` (4 subsequent siblings) 12 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Liang Ma, Jeff Guo, Haiyue Wang, thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: Liang Ma <liang.j.ma@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 +++++++++++++++++++++++++ drivers/net/ixgbe/ixgbe_rxtx.h | 1 + 3 files changed, 27 insertions(+) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index d7a1806ab8..97acf35d24 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -560,6 +560,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = { .udp_tunnel_port_del = ixgbe_dev_udp_tunnel_port_del, .tm_ops_get = ixgbe_tm_ops_get, .tx_done_cleanup = ixgbe_dev_tx_done_cleanup, + .get_monitor_addr = ixgbe_get_monitor_addr, }; /* diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index 7bb8460359..cc8f70e6dd 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.c +++ b/drivers/net/ixgbe/ixgbe_rxtx.c @@ -1369,6 +1369,31 @@ const uint32_t RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP, }; +int +ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + volatile union ixgbe_adv_rx_desc *rxdp; + struct ixgbe_rx_queue *rxq = rx_queue; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.upper.status_error; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + + /* the registers are 32-bit */ + pmc->data_sz = sizeof(uint32_t); + + return 0; +} + /* @note: fix ixgbe_dev_supported_ptypes_get() if any change here. */ static inline uint32_t ixgbe_rxd_pkt_info_to_pkt_type(uint32_t pkt_info, uint16_t ptype_mask) diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h index 6d2f7c9da3..8a25e98df6 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.h +++ b/drivers/net/ixgbe/ixgbe_rxtx.h @@ -299,5 +299,6 @@ uint64_t ixgbe_get_tx_port_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_rx_queue_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_rx_port_offloads(struct rte_eth_dev *dev); uint64_t ixgbe_get_tx_queue_offloads(struct rte_eth_dev *dev); +int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); #endif /* _IXGBE_RXTX_H_ */ -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 09/11] net/i40e: implement power management API 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov ` (7 preceding siblings ...) 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 08/11] net/ixgbe: implement power management API Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 10/11] net/ice: " Anatoly Burakov ` (3 subsequent siblings) 12 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Liang Ma, Beilei Xing, Jeff Guo, thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Acked-by: Jeff Guo <jia.guo@intel.com> --- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 25 +++++++++++++++++++++++++ drivers/net/i40e/i40e_rxtx.h | 1 + 3 files changed, 27 insertions(+) diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c index 14622484a0..ba1abc584f 100644 --- a/drivers/net/i40e/i40e_ethdev.c +++ b/drivers/net/i40e/i40e_ethdev.c @@ -510,6 +510,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = { .mtu_set = i40e_dev_mtu_set, .tm_ops_get = i40e_tm_ops_get, .tx_done_cleanup = i40e_tx_done_cleanup, + .get_monitor_addr = i40e_get_monitor_addr, }; /* store statistics names and its offset in stats structure */ diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 5df9a9df56..0b4220fc9c 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -72,6 +72,31 @@ #define I40E_TX_OFFLOAD_NOTSUP_MASK \ (PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK) +int +i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + struct i40e_rx_queue *rxq = rx_queue; + volatile union i40e_rx_desc *rxdp; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.qword1.status_error_len; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + + /* registers are 64-bit */ + pmc->data_sz = sizeof(uint64_t); + + return 0; +} + static inline void i40e_rxd_to_vlan_tci(struct rte_mbuf *mb, volatile union i40e_rx_desc *rxdp) { diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h index 57d7b4160b..e1494525ce 100644 --- a/drivers/net/i40e/i40e_rxtx.h +++ b/drivers/net/i40e/i40e_rxtx.h @@ -248,6 +248,7 @@ uint16_t i40e_recv_scattered_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts); uint16_t i40e_xmit_pkts_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); +int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); /* For each value it means, datasheet of hardware can tell more details * -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 10/11] net/ice: implement power management API 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov ` (8 preceding siblings ...) 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 09/11] net/i40e: " Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov ` (2 subsequent siblings) 12 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Liang Ma, Qiming Yang, Qi Zhang, thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Implement support for the power management API by implementing a `get_monitor_addr` function that will return an address of an RX ring's status bit. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++++++++++++++++++++++++++ drivers/net/ice/ice_rxtx.h | 1 + 3 files changed, 28 insertions(+) diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c index 587f485ee3..38c6263946 100644 --- a/drivers/net/ice/ice_ethdev.c +++ b/drivers/net/ice/ice_ethdev.c @@ -216,6 +216,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = { .udp_tunnel_port_add = ice_dev_udp_tunnel_port_add, .udp_tunnel_port_del = ice_dev_udp_tunnel_port_del, .tx_done_cleanup = ice_tx_done_cleanup, + .get_monitor_addr = ice_get_monitor_addr, }; /* store statistics names and its offset in stats structure */ diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c index d052bd0f1b..066651dc48 100644 --- a/drivers/net/ice/ice_rxtx.c +++ b/drivers/net/ice/ice_rxtx.c @@ -26,6 +26,32 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask; uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask; uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask; +int +ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + volatile union ice_rx_flex_desc *rxdp; + struct ice_rx_queue *rxq = rx_queue; + uint16_t desc; + + desc = rxq->rx_tail; + rxdp = &rxq->rx_ring[desc]; + /* watch for changes in status bit */ + pmc->addr = &rxdp->wb.status_error0; + + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + + /* register is 16-bit */ + pmc->data_sz = sizeof(uint16_t); + + return 0; +} + + static inline uint8_t ice_proto_xtr_type_to_rxdid(uint8_t xtr_type) { diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h index 6b16716063..906fbefdc4 100644 --- a/drivers/net/ice/ice_rxtx.h +++ b/drivers/net/ice/ice_rxtx.h @@ -263,6 +263,7 @@ uint16_t ice_xmit_pkts_vec_avx512(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts); int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc); int ice_tx_done_cleanup(void *txq, uint32_t free_cnt); +int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc); #define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \ int i; \ -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v16 11/11] examples/l3fwd-power: enable PMD power mgmt 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov ` (9 preceding siblings ...) 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 10/11] net/ice: " Anatoly Burakov @ 2021-01-12 17:37 ` Anatoly Burakov 2021-01-14 9:36 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management David Marchand 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 " Anatoly Burakov 12 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-12 17:37 UTC (permalink / raw) To: dev Cc: Liang Ma, David Hunt, thomas, konstantin.ananyev, timothy.mcdaniel, bruce.richardson, chris.macnamara From: Liang Ma <liang.j.ma@intel.com> Add PMD power management feature support to l3fwd-power sample app. Signed-off-by: Liang Ma <liang.j.ma@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> --- Notes: v12: - Allow selecting PMD power management scheme from command-line - Enforce 1 core 1 queue rule .../sample_app_ug/l3_forward_power_man.rst | 35 ++++++++ examples/l3fwd-power/main.c | 89 ++++++++++++++++++- 2 files changed, 122 insertions(+), 2 deletions(-) diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst index 85a78a5c1e..aaa9367fae 100644 --- a/doc/guides/sample_app_ug/l3_forward_power_man.rst +++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst @@ -109,6 +109,8 @@ where, * --telemetry: Telemetry mode. +* --pmd-mgmt: PMD power management mode. + See :doc:`l3_forward` for details. The L3fwd-power example reuses the L3fwd command line options. @@ -456,3 +458,36 @@ reference cycles and accordingly busy rate is set to either 0% or The new stats ``empty_poll`` , ``full_poll`` and ``busy_percent`` can be viewed by running the script ``/usertools/dpdk-telemetry-client.py`` and selecting the menu option ``Send for global Metrics``. + +PMD power management Mode +------------------------- + +The PMD power management mode support for ``l3fwd-power`` is a standalone mode, in this mode +``l3fwd-power`` does simple l3fwding along with enable the power saving scheme on specific +port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API. + +.. code-block:: console + + ./build/examples/dpdk-l3fwd-power -l 1-3 -- --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)" + +PMD Power Management Mode +------------------------- +There is also a traffic-aware operating mode that, instead of using explicit +power management, will use automatic PMD power management. This mode is limited +to one queue per core, and has three available power management schemes: + +* ``monitor`` - this will use ``rte_power_monitor()`` function to enter a + power-optimized state (subject to platform support). + +* ``pause`` - this will use ``rte_power_pause()`` or ``rte_pause()`` to avoid + busy looping when there is no traffic. + +* ``scale`` - this will use frequency scaling routines available in the + ``librte_power`` library. + +See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK +Programmer's Guide for more details on PMD power management. + +.. code-block:: console + + ./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index 995a3b6ad7..e312b6f355 100644 --- a/examples/l3fwd-power/main.c +++ b/examples/l3fwd-power/main.c @@ -47,6 +47,7 @@ #include <rte_power_empty_poll.h> #include <rte_metrics.h> #include <rte_telemetry.h> +#include <rte_power_pmd_mgmt.h> #include "perf_core.h" #include "main.h" @@ -199,11 +200,14 @@ enum appmode { APP_MODE_LEGACY, APP_MODE_EMPTY_POLL, APP_MODE_TELEMETRY, - APP_MODE_INTERRUPT + APP_MODE_INTERRUPT, + APP_MODE_PMD_MGMT }; enum appmode app_mode; +static enum rte_power_pmd_mgmt_type pmgmt_type; + enum freq_scale_hint_t { FREQ_LOWER = -1, @@ -1611,7 +1615,9 @@ print_usage(const char *prgname) " follow (training_flag, high_threshold, med_threshold)\n" " --telemetry: enable telemetry mode, to update" " empty polls, full polls, and core busyness to telemetry\n" - " --interrupt-only: enable interrupt-only mode\n", + " --interrupt-only: enable interrupt-only mode\n" + " --pmd-mgmt MODE: enable PMD power management mode. " + "Currently supported modes: monitor, pause, scale\n", prgname); } @@ -1701,6 +1707,32 @@ parse_config(const char *q_arg) return 0; } + +static int +parse_pmd_mgmt_config(const char *name) +{ +#define PMD_MGMT_MONITOR "monitor" +#define PMD_MGMT_PAUSE "pause" +#define PMD_MGMT_SCALE "scale" + + if (strncmp(PMD_MGMT_MONITOR, name, sizeof(PMD_MGMT_MONITOR)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_MONITOR; + return 0; + } + + if (strncmp(PMD_MGMT_PAUSE, name, sizeof(PMD_MGMT_PAUSE)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_PAUSE; + return 0; + } + + if (strncmp(PMD_MGMT_SCALE, name, sizeof(PMD_MGMT_SCALE)) == 0) { + pmgmt_type = RTE_POWER_MGMT_TYPE_SCALE; + return 0; + } + /* unknown PMD power management mode */ + return -1; +} + static int parse_ep_config(const char *q_arg) { @@ -1755,6 +1787,7 @@ parse_ep_config(const char *q_arg) #define CMD_LINE_OPT_EMPTY_POLL "empty-poll" #define CMD_LINE_OPT_INTERRUPT_ONLY "interrupt-only" #define CMD_LINE_OPT_TELEMETRY "telemetry" +#define CMD_LINE_OPT_PMD_MGMT "pmd-mgmt" /* Parse the argument given in the command line of the application */ static int @@ -1776,6 +1809,7 @@ parse_args(int argc, char **argv) {CMD_LINE_OPT_LEGACY, 0, 0, 0}, {CMD_LINE_OPT_TELEMETRY, 0, 0, 0}, {CMD_LINE_OPT_INTERRUPT_ONLY, 0, 0, 0}, + {CMD_LINE_OPT_PMD_MGMT, 1, 0, 0}, {NULL, 0, 0, 0} }; @@ -1886,6 +1920,21 @@ parse_args(int argc, char **argv) printf("telemetry mode is enabled\n"); } + if (!strncmp(lgopts[option_index].name, + CMD_LINE_OPT_PMD_MGMT, + sizeof(CMD_LINE_OPT_PMD_MGMT))) { + if (app_mode != APP_MODE_DEFAULT) { + printf(" power mgmt mode is mutually exclusive with other modes\n"); + return -1; + } + if (parse_pmd_mgmt_config(optarg) < 0) { + printf(" Invalid PMD power management mode: %s\n", + optarg); + return -1; + } + app_mode = APP_MODE_PMD_MGMT; + printf("PMD power mgmt mode is enabled\n"); + } if (!strncmp(lgopts[option_index].name, CMD_LINE_OPT_INTERRUPT_ONLY, sizeof(CMD_LINE_OPT_INTERRUPT_ONLY))) { @@ -2442,6 +2491,8 @@ mode_to_str(enum appmode mode) return "telemetry"; case APP_MODE_INTERRUPT: return "interrupt-only"; + case APP_MODE_PMD_MGMT: + return "pmd mgmt"; default: return "invalid"; } @@ -2671,6 +2722,13 @@ main(int argc, char **argv) qconf = &lcore_conf[lcore_id]; printf("\nInitializing rx queues on lcore %u ... ", lcore_id ); fflush(stdout); + + /* PMD power management mode can only do 1 queue per core */ + if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) { + rte_exit(EXIT_FAILURE, + "In PMD power management mode, only one queue per lcore is allowed\n"); + } + /* init RX queues */ for(queue = 0; queue < qconf->n_rx_queue; ++queue) { struct rte_eth_rxconf rxq_conf; @@ -2708,6 +2766,16 @@ main(int argc, char **argv) rte_exit(EXIT_FAILURE, "Fail to add ptype cb\n"); } + + if (app_mode == APP_MODE_PMD_MGMT) { + ret = rte_power_pmd_mgmt_queue_enable( + lcore_id, portid, queueid, + pmgmt_type); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_power_pmd_mgmt_queue_enable: err=%d, port=%d\n", + ret, portid); + } } } @@ -2798,6 +2866,9 @@ main(int argc, char **argv) SKIP_MAIN); } else if (app_mode == APP_MODE_INTERRUPT) { rte_eal_mp_remote_launch(main_intr_loop, NULL, CALL_MAIN); + } else if (app_mode == APP_MODE_PMD_MGMT) { + /* reuse telemetry loop for PMD power management mode */ + rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN); } if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY) @@ -2824,6 +2895,20 @@ main(int argc, char **argv) if (app_mode == APP_MODE_EMPTY_POLL) rte_power_empty_poll_stat_free(); + if (app_mode == APP_MODE_PMD_MGMT) { + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { + if (rte_lcore_is_enabled(lcore_id) == 0) + continue; + qconf = &lcore_conf[lcore_id]; + for (queue = 0; queue < qconf->n_rx_queue; ++queue) { + portid = qconf->rx_queue_list[queue].port_id; + queueid = qconf->rx_queue_list[queue].queue_id; + rte_power_pmd_mgmt_queue_disable(lcore_id, + portid, queueid); + } + } + } + if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) && deinit_power_library()) rte_exit(EXIT_FAILURE, "deinit_power_library failed\n"); -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 00/11] Add PMD power management 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov ` (10 preceding siblings ...) 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 11/11] examples/l3fwd-power: enable PMD power mgmt Anatoly Burakov @ 2021-01-14 9:36 ` David Marchand 2021-01-14 10:25 ` Burakov, Anatoly 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 " Anatoly Burakov 12 siblings, 1 reply; 421+ messages in thread From: David Marchand @ 2021-01-14 9:36 UTC (permalink / raw) To: Anatoly Burakov, Ray Kinsella Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara, Kevin Traynor On Tue, Jan 12, 2021 at 6:37 PM Anatoly Burakov <anatoly.burakov@intel.com> wrote: > > This patchset proposes a simple API for Ethernet drivers to cause the > CPU to enter a power-optimized state while waiting for packets to > arrive. There are multiple proposed mechanisms to achieve said power > savings: simple frequency scaling, idle loop, and monitoring the Rx > queue for incoming packages. The latter is achieved through cooperation > with the NIC driver that will allow us to know address of wake up event, > and wait for writes on that address. > > On IA, this is achieved through using UMONITOR/UMWAIT instructions. They > are used in their raw opcode form because there is no widespread > compiler support for them yet. Still, the API is made generic enough to > hopefully support other architectures, if they happen to implement > similar instructions. > > To achieve power savings, there is a very simple mechanism used: we're > counting empty polls, and if a certain threshold is reached, we employ > one of the suggested power management schemes automatically, from within > a Rx callback inside the PMD. Once there's traffic again, the empty poll > counter is reset. > > This patchset also introduces a few changes into existing power > management-related intrinsics, namely to provide a native way of waking > up a sleeping core without application being responsible for it, as well > as general robustness improvements. There's quite a bit of locking going > on, but these locks are per-thread and very little (if any) contention > is expected, so the performance impact shouldn't be that bad (and in any > case the locking happens when we're about to sleep anyway). > > Why are we putting it into ethdev as opposed to leaving this up to the > application? Our customers specifically requested a way to do it with > minimal changes to the application code. The current approach allows to > just flip a switch and automatically have power savings. > > Things of note: > > - Only 1:1 core to queue mapping is supported, meaning that each lcore > must at most handle RX on a single queue If we want to save power, it is likely we would poll more rxqs on a thread. > - Support 3 type policies. Monitor/Pause/Frequency Scaling > - Power management is enabled per-queue > - The API doesn't extend to other device types > > v16: > - Implemented Konstantin's suggestions and comments > - Added return values to the API - This revision breaks SPDK build (reported by UNH): http://mails.dpdk.org/archives/test-report/2021-January/174069.html - Build is broken for ARM and PPC at patch: 86491d5bd4 - (HEAD) eal: add monitor wakeup function (25 minutes ago) <Anatoly Burakov> Only pasting the ARM failure: ninja: Entering directory `/home/dmarchan/builds/build-arm64-host-clang' [1/297] Compiling C object 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o'. FAILED: lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o aarch64-linux-gnu-gcc -Ilib/76b5a35@@rte_eal@sta -Ilib -I../../dpdk/lib -I. -I../../dpdk/ -Iconfig -I../../dpdk/config -Ilib/librte_eal/include -I../../dpdk/lib/librte_eal/include -Ilib/librte_eal/linux/include -I../../dpdk/lib/librte_eal/linux/include -Ilib/librte_eal/arm/include -I../../dpdk/lib/librte_eal/arm/include -Ilib/librte_eal/common -I../../dpdk/lib/librte_eal/common -Ilib/librte_eal -I../../dpdk/lib/librte_eal -Ilib/librte_kvargs -I../../dpdk/lib/librte_kvargs -Ilib/librte_telemetry/../librte_metrics -I../../dpdk/lib/librte_telemetry/../librte_metrics -Ilib/librte_telemetry -I../../dpdk/lib/librte_telemetry -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -O2 -g -include rte_config.h -Wextra -Wcast-qual -Wdeprecated -Wformat -Wformat-nonliteral -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wpointer-arith -Wsign-compare -Wstrict-prototypes -Wundef -Wwrite-strings -Wno-packed-not-aligned -Wno-missing-field-initializers -D_GNU_SOURCE -fPIC -march=armv8-a+crc -DALLOW_EXPERIMENTAL_API -DALLOW_INTERNAL_API -Wno-format-truncation '-DABI_VERSION="21.1"' -DRTE_LIBEAL_USE_GETENTROPY -MD -MQ 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o' -MF 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o.d' -o 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o' -c ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:35:1: error: conflicting types for ‘rte_power_monitor_wakeup’ rte_power_monitor_wakeup(const unsigned int lcore_id) ^~~~~~~~~~~~~~~~~~~~~~~~ In file included from ../../dpdk/lib/librte_eal/arm/include/rte_power_intrinsics.h:14, from ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:5: ../../dpdk/lib/librte_eal/include/generic/rte_power_intrinsics.h:79:5: note: previous declaration of ‘rte_power_monitor_wakeup’ was here int rte_power_monitor_wakeup(const unsigned int lcore_id); ^~~~~~~~~~~~~~~~~~~~~~~~ ninja: build stopped: subcommand failed. - The ABI check is still not happy as I reported earlier. Reproduced on v16 (GHA had a hiccup on this revision, but previous ones had the failure too): 1 Changed variable: [C] 'rte_eth_dev rte_eth_devices[]' was changed at rte_ethdev_core.h:196:1: type of variable changed: array element type 'struct rte_eth_dev' changed: type size hasn't changed 1 data member change: type of 'const eth_dev_ops* rte_eth_dev::dev_ops' changed: in pointed to type 'const eth_dev_ops': in unqualified underlying type 'struct eth_dev_ops' at rte_ethdev_driver.h:789:1: type size changed from 6208 to 6272 (in bits) 1 data member insertion: 'eth_get_monitor_addr_t eth_dev_ops::get_monitor_addr', at offset 6208 (in bits) at rte_ethdev_driver.h:940:1 no data member changes (94 filtered); type size hasn't changed Error: ABI issue reported for 'abidiff --suppr /home/dmarchan/dpdk/devtools/../devtools/libabigail.abignore --no-added-syms --headers-dir1 /home/dmarchan/abi/v20.11/build-gcc-static/usr/local/include --headers-dir2 /home/dmarchan/builds/build-gcc-static/install/usr/local/include /home/dmarchan/abi/v20.11/build-gcc-static/dump/librte_ethdev.dump /home/dmarchan/builds/build-gcc-static/install/dump/librte_ethdev.dump' ABIDIFF_ABI_CHANGE, this change requires a review (abidiff flagged this as a potential issue). One solution is to add an exception on the eth_dev_ops structure. --- a/devtools/libabigail.abignore +++ b/devtools/libabigail.abignore @@ -7,3 +7,7 @@ symbol_version = INTERNAL [suppress_variable] symbol_version = INTERNAL + +; Explicit ignore for driver-only ABI +[suppress_type] + name = eth_dev_ops -- David marchand ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v16 00/11] Add PMD power management 2021-01-14 9:36 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management David Marchand @ 2021-01-14 10:25 ` Burakov, Anatoly 0 siblings, 0 replies; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-14 10:25 UTC (permalink / raw) To: David Marchand, Ray Kinsella Cc: dev, Thomas Monjalon, Ananyev, Konstantin, Timothy McDaniel, David Hunt, Bruce Richardson, chris.macnamara, Kevin Traynor On 14-Jan-21 9:36 AM, David Marchand wrote: > On Tue, Jan 12, 2021 at 6:37 PM Anatoly Burakov > <anatoly.burakov@intel.com> wrote: >> >> This patchset proposes a simple API for Ethernet drivers to cause the >> CPU to enter a power-optimized state while waiting for packets to >> arrive. There are multiple proposed mechanisms to achieve said power >> savings: simple frequency scaling, idle loop, and monitoring the Rx >> queue for incoming packages. The latter is achieved through cooperation >> with the NIC driver that will allow us to know address of wake up event, >> and wait for writes on that address. >> >> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They >> are used in their raw opcode form because there is no widespread >> compiler support for them yet. Still, the API is made generic enough to >> hopefully support other architectures, if they happen to implement >> similar instructions. >> >> To achieve power savings, there is a very simple mechanism used: we're >> counting empty polls, and if a certain threshold is reached, we employ >> one of the suggested power management schemes automatically, from within >> a Rx callback inside the PMD. Once there's traffic again, the empty poll >> counter is reset. >> >> This patchset also introduces a few changes into existing power >> management-related intrinsics, namely to provide a native way of waking >> up a sleeping core without application being responsible for it, as well >> as general robustness improvements. There's quite a bit of locking going >> on, but these locks are per-thread and very little (if any) contention >> is expected, so the performance impact shouldn't be that bad (and in any >> case the locking happens when we're about to sleep anyway). >> >> Why are we putting it into ethdev as opposed to leaving this up to the >> application? Our customers specifically requested a way to do it with >> minimal changes to the application code. The current approach allows to >> just flip a switch and automatically have power savings. >> >> Things of note: >> >> - Only 1:1 core to queue mapping is supported, meaning that each lcore >> must at most handle RX on a single queue > > If we want to save power, it is likely we would poll more rxqs on a thread. We are investigating possibilities to make that happen, but for this patchset, this is the limitation. > > >> - Support 3 type policies. Monitor/Pause/Frequency Scaling >> - Power management is enabled per-queue >> - The API doesn't extend to other device types >> >> v16: >> - Implemented Konstantin's suggestions and comments >> - Added return values to the API > > - This revision breaks SPDK build (reported by UNH): > http://mails.dpdk.org/archives/test-report/2021-January/174069.html > > > - Build is broken for ARM and PPC at patch: > 86491d5bd4 - (HEAD) eal: add monitor wakeup function (25 minutes ago) > <Anatoly Burakov> > > Only pasting the ARM failure: > ninja: Entering directory `/home/dmarchan/builds/build-arm64-host-clang' > [1/297] Compiling C object > 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o'. > FAILED: lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o > aarch64-linux-gnu-gcc -Ilib/76b5a35@@rte_eal@sta -Ilib > -I../../dpdk/lib -I. -I../../dpdk/ -Iconfig -I../../dpdk/config > -Ilib/librte_eal/include -I../../dpdk/lib/librte_eal/include > -Ilib/librte_eal/linux/include > -I../../dpdk/lib/librte_eal/linux/include -Ilib/librte_eal/arm/include > -I../../dpdk/lib/librte_eal/arm/include -Ilib/librte_eal/common > -I../../dpdk/lib/librte_eal/common -Ilib/librte_eal > -I../../dpdk/lib/librte_eal -Ilib/librte_kvargs > -I../../dpdk/lib/librte_kvargs > -Ilib/librte_telemetry/../librte_metrics > -I../../dpdk/lib/librte_telemetry/../librte_metrics > -Ilib/librte_telemetry -I../../dpdk/lib/librte_telemetry > -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall > -Winvalid-pch -Werror -O2 -g -include rte_config.h -Wextra -Wcast-qual > -Wdeprecated -Wformat -Wformat-nonliteral -Wformat-security > -Wmissing-declarations -Wmissing-prototypes -Wnested-externs > -Wold-style-definition -Wpointer-arith -Wsign-compare > -Wstrict-prototypes -Wundef -Wwrite-strings -Wno-packed-not-aligned > -Wno-missing-field-initializers -D_GNU_SOURCE -fPIC -march=armv8-a+crc > -DALLOW_EXPERIMENTAL_API -DALLOW_INTERNAL_API -Wno-format-truncation > '-DABI_VERSION="21.1"' -DRTE_LIBEAL_USE_GETENTROPY -MD -MQ > 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o' -MF > 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o.d' > -o 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o' > -c ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c > ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:35:1: error: > conflicting types for ‘rte_power_monitor_wakeup’ > rte_power_monitor_wakeup(const unsigned int lcore_id) > ^~~~~~~~~~~~~~~~~~~~~~~~ > In file included from > ../../dpdk/lib/librte_eal/arm/include/rte_power_intrinsics.h:14, > from ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:5: > ../../dpdk/lib/librte_eal/include/generic/rte_power_intrinsics.h:79:5: > note: previous declaration of ‘rte_power_monitor_wakeup’ was here > int rte_power_monitor_wakeup(const unsigned int lcore_id); > ^~~~~~~~~~~~~~~~~~~~~~~~ > ninja: build stopped: subcommand failed. Woops, wrong return value in the .c files. Will fix! > > > > - The ABI check is still not happy as I reported earlier. > Reproduced on v16 (GHA had a hiccup on this revision, but previous > ones had the failure too): > > 1 Changed variable: > > [C] 'rte_eth_dev rte_eth_devices[]' was changed at rte_ethdev_core.h:196:1: > type of variable changed: > array element type 'struct rte_eth_dev' changed: > type size hasn't changed > 1 data member change: > type of 'const eth_dev_ops* rte_eth_dev::dev_ops' changed: > in pointed to type 'const eth_dev_ops': > in unqualified underlying type 'struct eth_dev_ops' at > rte_ethdev_driver.h:789:1: > type size changed from 6208 to 6272 (in bits) > 1 data member insertion: > 'eth_get_monitor_addr_t > eth_dev_ops::get_monitor_addr', at offset 6208 (in bits) at > rte_ethdev_driver.h:940:1 > no data member changes (94 filtered); > type size hasn't changed > > Error: ABI issue reported for 'abidiff --suppr > /home/dmarchan/dpdk/devtools/../devtools/libabigail.abignore > --no-added-syms --headers-dir1 > /home/dmarchan/abi/v20.11/build-gcc-static/usr/local/include > --headers-dir2 /home/dmarchan/builds/build-gcc-static/install/usr/local/include > /home/dmarchan/abi/v20.11/build-gcc-static/dump/librte_ethdev.dump > /home/dmarchan/builds/build-gcc-static/install/dump/librte_ethdev.dump' > > ABIDIFF_ABI_CHANGE, this change requires a review (abidiff flagged > this as a potential issue). > > One solution is to add an exception on the eth_dev_ops structure. > > --- a/devtools/libabigail.abignore > +++ b/devtools/libabigail.abignore > @@ -7,3 +7,7 @@ > symbol_version = INTERNAL > [suppress_variable] > symbol_version = INTERNAL > + > +; Explicit ignore for driver-only ABI > +[suppress_type] > + name = eth_dev_ops > > Right, OK. I didn't realize an "exception" is something you actually do in code, not an ad-hoc community process type thing :) I'll add this in the next revision. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v17 00/11] Add PMD power management 2021-01-12 17:37 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management Anatoly Burakov ` (11 preceding siblings ...) 2021-01-14 9:36 ` [dpdk-dev] [PATCH v16 00/11] Add PMD power management David Marchand @ 2021-01-14 14:46 ` Anatoly Burakov 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 01/11] eal: uninline power intrinsics Anatoly Burakov ` (12 more replies) 12 siblings, 13 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw) To: dev Cc: thomas, konstantin.ananyev, timothy.mcdaniel, david.hunt, bruce.richardson, chris.macnamara This patchset proposes a simple API for Ethernet drivers to cause the CPU to enter a power-optimized state while waiting for packets to arrive. There are multiple proposed mechanisms to achieve said power savings: simple frequency scaling, idle loop, and monitoring the Rx queue for incoming packages. The latter is achieved through cooperation with the NIC driver that will allow us to know address of wake up event, and wait for writes on that address. On IA, this is achieved through using UMONITOR/UMWAIT instructions. They are used in their raw opcode form because there is no widespread compiler support for them yet. Still, the API is made generic enough to hopefully support other architectures, if they happen to implement similar instructions. To achieve power savings, there is a very simple mechanism used: we're counting empty polls, and if a certain threshold is reached, we employ one of the suggested power management schemes automatically, from within a Rx callback inside the PMD. Once there's traffic again, the empty poll counter is reset. This patchset also introduces a few changes into existing power management-related intrinsics, namely to provide a native way of waking up a sleeping core without application being responsible for it, as well as general robustness improvements. There's quite a bit of locking going on, but these locks are per-thread and very little (if any) contention is expected, so the performance impact shouldn't be that bad (and in any case the locking happens when we're about to sleep anyway). Why are we putting it into ethdev as opposed to leaving this up to the application? Our customers specifically requested a way to do it with minimal changes to the application code. The current approach allows to just flip a switch and automatically have power savings. Things of note: - Only 1:1 core to queue mapping is supported, meaning that each lcore must at most handle RX on a single queue - Support 3 type policies. Monitor/Pause/Frequency Scaling - Power management is enabled per-queue - The API doesn't extend to other device types v17: - Added exception for ethdev driver-only ABI - Added memory barriers for monitor/wakeup (Konstantin) - Fixed compiled issues on non-x86 platforms (hopefully!) v16: - Implemented Konstantin's suggestions and comments - Added return values to the API v15: - Fixed incorrect check in UMWAIT callback - Fixed accidental whitespace changes v14: - Fixed ARM/PPC builds - Addressed various review comments v13: - Reworked the librte_power code to require less locking and handle invalid parameters better - Fix numerous rebase errors present in v12 v12: - Rebase on top of 21.02 - Rework of power intrinsics code Anatoly Burakov (5): eal: uninline power intrinsics eal: avoid invalid API usage in power intrinsics eal: change API of power intrinsics eal: remove sync version of power monitor eal: add monitor wakeup function Liang Ma (6): ethdev: add simple power management API power: add PMD power management API and callback net/ixgbe: implement power management API net/i40e: implement power management API net/ice: implement power management API examples/l3fwd-power: enable PMD power mgmt devtools/libabigail.abignore | 3 + doc/guides/prog_guide/power_man.rst | 44 +++ doc/guides/rel_notes/release_21_02.rst | 15 + .../sample_app_ug/l3_forward_power_man.rst | 35 ++ drivers/event/dlb/dlb.c | 10 +- drivers/event/dlb2/dlb2.c | 10 +- drivers/net/i40e/i40e_ethdev.c | 1 + drivers/net/i40e/i40e_rxtx.c | 25 ++ drivers/net/i40e/i40e_rxtx.h | 1 + drivers/net/ice/ice_ethdev.c | 1 + drivers/net/ice/ice_rxtx.c | 26 ++ drivers/net/ice/ice_rxtx.h | 1 + drivers/net/ixgbe/ixgbe_ethdev.c | 1 + drivers/net/ixgbe/ixgbe_rxtx.c | 25 ++ drivers/net/ixgbe/ixgbe_rxtx.h | 1 + examples/l3fwd-power/main.c | 89 ++++- .../arm/include/rte_power_intrinsics.h | 40 -- lib/librte_eal/arm/meson.build | 1 + lib/librte_eal/arm/rte_power_intrinsics.c | 40 ++ .../include/generic/rte_power_intrinsics.h | 88 ++--- .../ppc/include/rte_power_intrinsics.h | 40 -- lib/librte_eal/ppc/meson.build | 1 + lib/librte_eal/ppc/rte_power_intrinsics.c | 40 ++ lib/librte_eal/version.map | 3 + .../x86/include/rte_power_intrinsics.h | 115 ------ lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 215 +++++++++++ lib/librte_ethdev/rte_ethdev.c | 28 ++ lib/librte_ethdev/rte_ethdev.h | 25 ++ lib/librte_ethdev/rte_ethdev_driver.h | 22 ++ lib/librte_ethdev/version.map | 3 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 364 ++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 90 +++++ lib/librte_power/version.map | 5 + 35 files changed, 1155 insertions(+), 259 deletions(-) create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v17 01/11] eal: uninline power intrinsics 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 " Anatoly Burakov @ 2021-01-14 14:46 ` Anatoly Burakov 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 02/11] eal: avoid invalid API usage in " Anatoly Burakov ` (11 subsequent siblings) 12 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw) To: dev Cc: Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, power intrinsics are inline functions. Make them part of the ABI so that we can have various internal data associated with them without exposing said data to the outside world. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- Notes: v14: - Fix compile issues on ARM and PPC64 by moving implementations to .c files .../arm/include/rte_power_intrinsics.h | 40 ------ lib/librte_eal/arm/meson.build | 1 + lib/librte_eal/arm/rte_power_intrinsics.c | 45 +++++++ .../include/generic/rte_power_intrinsics.h | 6 +- .../ppc/include/rte_power_intrinsics.h | 40 ------ lib/librte_eal/ppc/meson.build | 1 + lib/librte_eal/ppc/rte_power_intrinsics.c | 45 +++++++ lib/librte_eal/version.map | 3 + .../x86/include/rte_power_intrinsics.h | 115 ----------------- lib/librte_eal/x86/meson.build | 1 + lib/librte_eal/x86/rte_power_intrinsics.c | 120 ++++++++++++++++++ 11 files changed, 219 insertions(+), 198 deletions(-) create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h b/lib/librte_eal/arm/include/rte_power_intrinsics.h index a4a1bc1159..9e498e9ebf 100644 --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h @@ -13,46 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on ARM. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - RTE_SET_USED(tsc_timestamp); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build index d62875ebae..6ec53ea03a 100644 --- a/lib/librte_eal/arm/meson.build +++ b/lib/librte_eal/arm/meson.build @@ -7,4 +7,5 @@ sources += files( 'rte_cpuflags.c', 'rte_cycles.c', 'rte_hypervisor.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c new file mode 100644 index 0000000000..ab1f44f611 --- /dev/null +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -0,0 +1,45 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2021 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +/** + * This function is not supported on ARM. + */ +void +rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on ARM. + */ +void +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(lck); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on ARM. + */ +void +rte_power_pause(const uint64_t tsc_timestamp) +{ + RTE_SET_USED(tsc_timestamp); +} diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index dd520d90fa..67977bd511 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -52,7 +52,7 @@ * to undefined result. */ __rte_experimental -static inline void rte_power_monitor(const volatile void *p, +void rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz); @@ -97,7 +97,7 @@ static inline void rte_power_monitor(const volatile void *p, * wakes up. */ __rte_experimental -static inline void rte_power_monitor_sync(const volatile void *p, +void rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck); @@ -118,6 +118,6 @@ static inline void rte_power_monitor_sync(const volatile void *p, * architecture-dependent. */ __rte_experimental -static inline void rte_power_pause(const uint64_t tsc_timestamp); +void rte_power_pause(const uint64_t tsc_timestamp); #endif /* _RTE_POWER_INTRINSIC_H_ */ diff --git a/lib/librte_eal/ppc/include/rte_power_intrinsics.h b/lib/librte_eal/ppc/include/rte_power_intrinsics.h index 4ed03d521f..c0e9ac279f 100644 --- a/lib/librte_eal/ppc/include/rte_power_intrinsics.h +++ b/lib/librte_eal/ppc/include/rte_power_intrinsics.h @@ -13,46 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); - RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(lck); - RTE_SET_USED(data_sz); -} - -/** - * This function is not supported on PPC64. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - RTE_SET_USED(tsc_timestamp); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/ppc/meson.build b/lib/librte_eal/ppc/meson.build index f4b6d95c42..43c46542fb 100644 --- a/lib/librte_eal/ppc/meson.build +++ b/lib/librte_eal/ppc/meson.build @@ -7,4 +7,5 @@ sources += files( 'rte_cpuflags.c', 'rte_cycles.c', 'rte_hypervisor.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c new file mode 100644 index 0000000000..84340ca2a4 --- /dev/null +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -0,0 +1,45 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2021 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +/** + * This function is not supported on PPC64. + */ +void +rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on PPC64. + */ +void +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + RTE_SET_USED(p); + RTE_SET_USED(expected_value); + RTE_SET_USED(value_mask); + RTE_SET_USED(tsc_timestamp); + RTE_SET_USED(lck); + RTE_SET_USED(data_sz); +} + +/** + * This function is not supported on PPC64. + */ +void +rte_power_pause(const uint64_t tsc_timestamp) +{ + RTE_SET_USED(tsc_timestamp); +} diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map index b1db7ec795..32eceb8869 100644 --- a/lib/librte_eal/version.map +++ b/lib/librte_eal/version.map @@ -405,6 +405,9 @@ EXPERIMENTAL { rte_vect_set_max_simd_bitwidth; # added in 21.02 + rte_power_monitor; + rte_power_monitor_sync; + rte_power_pause; rte_thread_tls_key_create; rte_thread_tls_key_delete; rte_thread_tls_value_get; diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h index c7d790c854..e4c2b87f73 100644 --- a/lib/librte_eal/x86/include/rte_power_intrinsics.h +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h @@ -13,121 +13,6 @@ extern "C" { #include "generic/rte_power_intrinsics.h" -static inline uint64_t -__rte_power_get_umwait_val(const volatile void *p, const uint8_t sz) -{ - switch (sz) { - case sizeof(uint8_t): - return *(const volatile uint8_t *)p; - case sizeof(uint16_t): - return *(const volatile uint16_t *)p; - case sizeof(uint32_t): - return *(const volatile uint32_t *)p; - case sizeof(uint64_t): - return *(const volatile uint64_t *)p; - default: - /* this is an intrinsic, so we can't have any error handling */ - RTE_ASSERT(0); - return 0; - } -} - -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(p)); - - if (value_mask) { - const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; - - /* if the masked value is already matching, abort */ - if (masked == expected_value) - return; - } - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); -} - -/** - * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. - * For more information about usage of these instructions, please refer to - * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - /* - * we're using raw byte codes for now as only the newest compiler - * versions support this instruction natively. - */ - - /* set address for UMONITOR */ - asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" - : - : "D"(p)); - - if (value_mask) { - const uint64_t cur_value = __rte_power_get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; - - /* if the masked value is already matching, abort */ - if (masked == expected_value) - return; - } - rte_spinlock_unlock(lck); - - /* execute UMWAIT */ - asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); - - rte_spinlock_lock(lck); -} - -/** - * This function uses TPAUSE instruction and will enter C0.2 state. For more - * information about usage of this instruction, please refer to Intel(R) 64 and - * IA-32 Architectures Software Developer's Manual. - */ -static inline void -rte_power_pause(const uint64_t tsc_timestamp) -{ - const uint32_t tsc_l = (uint32_t)tsc_timestamp; - const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); - - /* execute TPAUSE */ - asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" - : /* ignore rflags */ - : "D"(0), /* enter C0.2 */ - "a"(tsc_l), "d"(tsc_h)); -} - #ifdef __cplusplus } #endif diff --git a/lib/librte_eal/x86/meson.build b/lib/librte_eal/x86/meson.build index e78f29002e..dfd42dee0c 100644 --- a/lib/librte_eal/x86/meson.build +++ b/lib/librte_eal/x86/meson.build @@ -8,4 +8,5 @@ sources += files( 'rte_cycles.c', 'rte_hypervisor.c', 'rte_spinlock.c', + 'rte_power_intrinsics.c', ) diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c new file mode 100644 index 0000000000..34c5fd9c3e --- /dev/null +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -0,0 +1,120 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2020 Intel Corporation + */ + +#include "rte_power_intrinsics.h" + +static inline uint64_t +__get_umwait_val(const volatile void *p, const uint8_t sz) +{ + switch (sz) { + case sizeof(uint8_t): + return *(const volatile uint8_t *)p; + case sizeof(uint16_t): + return *(const volatile uint16_t *)p; + case sizeof(uint32_t): + return *(const volatile uint32_t *)p; + case sizeof(uint64_t): + return *(const volatile uint64_t *)p; + default: + /* this is an intrinsic, so we can't have any error handling */ + RTE_ASSERT(0); + return 0; + } +} + +/** + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. + * For more information about usage of these instructions, please refer to + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_monitor(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + + if (value_mask) { + const uint64_t cur_value = __get_umwait_val(p, data_sz); + const uint64_t masked = cur_value & value_mask; + + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return; + } + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} + +/** + * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. + * For more information about usage of these instructions, please refer to + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, + const uint64_t value_mask, const uint64_t tsc_timestamp, + const uint8_t data_sz, rte_spinlock_t *lck) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* + * we're using raw byte codes for now as only the newest compiler + * versions support this instruction natively. + */ + + /* set address for UMONITOR */ + asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" + : + : "D"(p)); + + if (value_mask) { + const uint64_t cur_value = __get_umwait_val(p, data_sz); + const uint64_t masked = cur_value & value_mask; + + /* if the masked value is already matching, abort */ + if (masked == expected_value) + return; + } + rte_spinlock_unlock(lck); + + /* execute UMWAIT */ + asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); + + rte_spinlock_lock(lck); +} + +/** + * This function uses TPAUSE instruction and will enter C0.2 state. For more + * information about usage of this instruction, please refer to Intel(R) 64 and + * IA-32 Architectures Software Developer's Manual. + */ +void +rte_power_pause(const uint64_t tsc_timestamp) +{ + const uint32_t tsc_l = (uint32_t)tsc_timestamp; + const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* execute TPAUSE */ + asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" + : /* ignore rflags */ + : "D"(0), /* enter C0.2 */ + "a"(tsc_l), "d"(tsc_h)); +} -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v17 02/11] eal: avoid invalid API usage in power intrinsics 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 " Anatoly Burakov 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 01/11] eal: uninline power intrinsics Anatoly Burakov @ 2021-01-14 14:46 ` Anatoly Burakov 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 03/11] eal: change API of " Anatoly Burakov ` (10 subsequent siblings) 12 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw) To: dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, the API documentation mandates that if the user wants to use the power management intrinsics, they need to call the `rte_cpu_get_intrinsics_support` API and check support for specific intrinsics. However, if the user does not do that, it is possible to get illegal instruction error because we're using raw instruction opcodes, which may or may not be supported at runtime. Now that we have everything in a C file, we can check for support at startup and prevent the user from possibly encountering illegal instruction errors. We also add return values to the API's as well, because why not. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- Notes: v16: - Add return values and proper error handling to the API v15: - Remove accidental whitespace changes v14: - Replace uint8_t with bool v14: - Replace uint8_t with bool lib/librte_eal/arm/rte_power_intrinsics.c | 12 +++- .../include/generic/rte_power_intrinsics.h | 24 +++++-- lib/librte_eal/ppc/rte_power_intrinsics.c | 12 +++- lib/librte_eal/x86/rte_power_intrinsics.c | 64 +++++++++++++++++-- 4 files changed, 94 insertions(+), 18 deletions(-) diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index ab1f44f611..7e7552fa8a 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -7,7 +7,7 @@ /** * This function is not supported on ARM. */ -void +int rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz) @@ -17,12 +17,14 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, RTE_SET_USED(value_mask); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(data_sz); + + return -ENOTSUP; } /** * This function is not supported on ARM. */ -void +int rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck) @@ -33,13 +35,17 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); RTE_SET_USED(data_sz); + + return -ENOTSUP; } /** * This function is not supported on ARM. */ -void +int rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); + + return -ENOTSUP; } diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 67977bd511..37e4ec0414 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -34,7 +34,6 @@ * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param p * Address to monitor for changes. @@ -50,9 +49,14 @@ * Data size (in bytes) that will be used to compare expected value with the * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead * to undefined result. + * + * @return + * 0 on success + * -EINVAL on invalid parameters + * -ENOTSUP if unsupported */ __rte_experimental -void rte_power_monitor(const volatile void *p, +int rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz); @@ -75,7 +79,6 @@ void rte_power_monitor(const volatile void *p, * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param p * Address to monitor for changes. @@ -95,9 +98,14 @@ void rte_power_monitor(const volatile void *p, * A spinlock that must be locked before entering the function, will be * unlocked while the CPU is sleeping, and will be locked again once the CPU * wakes up. + * + * @return + * 0 on success + * -EINVAL on invalid parameters + * -ENOTSUP if unsupported */ __rte_experimental -void rte_power_monitor_sync(const volatile void *p, +int rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck); @@ -111,13 +119,17 @@ void rte_power_monitor_sync(const volatile void *p, * * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. - * Failing to do so may result in an illegal CPU instruction error. * * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. + * + * @return + * 0 on success + * -EINVAL on invalid parameters + * -ENOTSUP if unsupported */ __rte_experimental -void rte_power_pause(const uint64_t tsc_timestamp); +int rte_power_pause(const uint64_t tsc_timestamp); #endif /* _RTE_POWER_INTRINSIC_H_ */ diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 84340ca2a4..929e0611b0 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -7,7 +7,7 @@ /** * This function is not supported on PPC64. */ -void +int rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz) @@ -17,12 +17,14 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, RTE_SET_USED(value_mask); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(data_sz); + + return -ENOTSUP; } /** * This function is not supported on PPC64. */ -void +int rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck) @@ -33,13 +35,17 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); RTE_SET_USED(data_sz); + + return -ENOTSUP; } /** * This function is not supported on PPC64. */ -void +int rte_power_pause(const uint64_t tsc_timestamp) { RTE_SET_USED(tsc_timestamp); + + return -ENOTSUP; } diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 34c5fd9c3e..2a38440bec 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -4,6 +4,8 @@ #include "rte_power_intrinsics.h" +static bool wait_supported; + static inline uint64_t __get_umwait_val(const volatile void *p, const uint8_t sz) { @@ -17,24 +19,47 @@ __get_umwait_val(const volatile void *p, const uint8_t sz) case sizeof(uint64_t): return *(const volatile uint64_t *)p; default: - /* this is an intrinsic, so we can't have any error handling */ + /* shouldn't happen */ RTE_ASSERT(0); return 0; } } +static inline int +__check_val_size(const uint8_t sz) +{ + switch (sz) { + case sizeof(uint8_t): /* fall-through */ + case sizeof(uint16_t): /* fall-through */ + case sizeof(uint32_t): /* fall-through */ + case sizeof(uint64_t): /* fall-through */ + return 0; + default: + /* unexpected size */ + return -1; + } +} + /** * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state. * For more information about usage of these instructions, please refer to * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ -void +int rte_power_monitor(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return -ENOTSUP; + + if (__check_val_size(data_sz) < 0) + return -EINVAL; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -51,13 +76,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, /* if the masked value is already matching, abort */ if (masked == expected_value) - return; + return 0; } /* execute UMWAIT */ asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" : /* ignore rflags */ : "D"(0), /* enter C0.2 */ "a"(tsc_l), "d"(tsc_h)); + + return 0; } /** @@ -65,13 +92,21 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * For more information about usage of these instructions, please refer to * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ -void +int rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, const uint64_t value_mask, const uint64_t tsc_timestamp, const uint8_t data_sz, rte_spinlock_t *lck) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return -ENOTSUP; + + if (__check_val_size(data_sz) < 0) + return -EINVAL; + /* * we're using raw byte codes for now as only the newest compiler * versions support this instruction natively. @@ -88,7 +123,7 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, /* if the masked value is already matching, abort */ if (masked == expected_value) - return; + return 0; } rte_spinlock_unlock(lck); @@ -99,6 +134,8 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, "a"(tsc_l), "d"(tsc_h)); rte_spinlock_lock(lck); + + return 0; } /** @@ -106,15 +143,30 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, * information about usage of this instruction, please refer to Intel(R) 64 and * IA-32 Architectures Software Developer's Manual. */ -void +int rte_power_pause(const uint64_t tsc_timestamp) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); + /* prevent user from running this instruction if it's not supported */ + if (!wait_supported) + return -ENOTSUP; + /* execute TPAUSE */ asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;" : /* ignore rflags */ : "D"(0), /* enter C0.2 */ "a"(tsc_l), "d"(tsc_h)); + + return 0; +} + +RTE_INIT(rte_power_intrinsics_init) { + struct rte_cpu_intrinsics i; + + rte_cpu_get_intrinsics_support(&i); + + if (i.power_monitor && i.power_pause) + wait_supported = 1; } -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 " Anatoly Burakov 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 01/11] eal: uninline power intrinsics Anatoly Burakov 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 02/11] eal: avoid invalid API usage in " Anatoly Burakov @ 2021-01-14 14:46 ` Anatoly Burakov 2021-01-18 22:26 ` Thomas Monjalon 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 04/11] eal: remove sync version of power monitor Anatoly Burakov ` (9 subsequent siblings) 12 siblings, 1 reply; 421+ messages in thread From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw) To: dev Cc: Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, thomas, david.hunt, chris.macnamara Instead of passing around pointers and integers, collect everything into struct. This makes API design around these intrinsics much easier. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- Notes: v16: - Add error handling drivers/event/dlb/dlb.c | 10 ++-- drivers/event/dlb2/dlb2.c | 10 ++-- lib/librte_eal/arm/rte_power_intrinsics.c | 20 +++----- .../include/generic/rte_power_intrinsics.h | 50 ++++++++----------- lib/librte_eal/ppc/rte_power_intrinsics.c | 20 +++----- lib/librte_eal/x86/rte_power_intrinsics.c | 42 +++++++++------- 6 files changed, 70 insertions(+), 82 deletions(-) diff --git a/drivers/event/dlb/dlb.c b/drivers/event/dlb/dlb.c index 0c95c4793d..d2f2026291 100644 --- a/drivers/event/dlb/dlb.c +++ b/drivers/event/dlb/dlb.c @@ -3161,6 +3161,7 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb, /* Interrupts not supported by PF PMD */ return 1; } else if (dlb->umwait_allowed) { + struct rte_power_monitor_cond pmc; volatile struct dlb_dequeue_qe *cq_base; union { uint64_t raw_qe[2]; @@ -3181,9 +3182,12 @@ dlb_dequeue_wait(struct dlb_eventdev *dlb, else expected_value = 0; - rte_power_monitor(monitor_addr, expected_value, - qe_mask.raw_qe[1], timeout + start_ticks, - sizeof(uint64_t)); + pmc.addr = monitor_addr; + pmc.val = expected_value; + pmc.mask = qe_mask.raw_qe[1]; + pmc.data_sz = sizeof(uint64_t); + + rte_power_monitor(&pmc, timeout + start_ticks); DLB_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1); } else { diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c index 86724863f2..c9a8a02278 100644 --- a/drivers/event/dlb2/dlb2.c +++ b/drivers/event/dlb2/dlb2.c @@ -2870,6 +2870,7 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, if (elapsed_ticks >= timeout) { return 1; } else if (dlb2->umwait_allowed) { + struct rte_power_monitor_cond pmc; volatile struct dlb2_dequeue_qe *cq_base; union { uint64_t raw_qe[2]; @@ -2890,9 +2891,12 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, else expected_value = 0; - rte_power_monitor(monitor_addr, expected_value, - qe_mask.raw_qe[1], timeout + start_ticks, - sizeof(uint64_t)); + pmc.addr = monitor_addr; + pmc.val = expected_value; + pmc.mask = qe_mask.raw_qe[1]; + pmc.data_sz = sizeof(uint64_t); + + rte_power_monitor(&pmc, timeout + start_ticks); DLB2_INC_STAT(ev_port->stats.traffic.rx_umonitor_umwait, 1); } else { diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c index 7e7552fa8a..5f1caaf25b 100644 --- a/lib/librte_eal/arm/rte_power_intrinsics.c +++ b/lib/librte_eal/arm/rte_power_intrinsics.c @@ -8,15 +8,11 @@ * This function is not supported on ARM. */ int -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); return -ENOTSUP; } @@ -25,16 +21,12 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * This function is not supported on ARM. */ int -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); - RTE_SET_USED(data_sz); return -ENOTSUP; } diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h index 37e4ec0414..3ad53068d5 100644 --- a/lib/librte_eal/include/generic/rte_power_intrinsics.h +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h @@ -18,6 +18,18 @@ * which are architecture-dependent. */ +struct rte_power_monitor_cond { + volatile void *addr; /**< Address to monitor for changes */ + uint64_t val; /**< Before attempting the monitoring, the address + * may be read and compared against this value. + **/ + uint64_t mask; /**< 64-bit mask to extract current value from addr */ + uint8_t data_sz; /**< Data size (in bytes) that will be used to compare + * expected value with the memory address. Can be 1, + * 2, 4, or 8. Supplying any other value will lead to + * undefined result. */ +}; + /** * @warning * @b EXPERIMENTAL: this API may change without prior notice @@ -35,20 +47,11 @@ * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. * - * @param p - * Address to monitor for changes. - * @param expected_value - * Before attempting the monitoring, the `p` address may be read and compared - * against this value. If `value_mask` is zero, this step will be skipped. - * @param value_mask - * The 64-bit mask to use to extract current value from `p`. + * @param pmc + * The monitoring condition structure. * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. - * @param data_sz - * Data size (in bytes) that will be used to compare expected value with the - * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead - * to undefined result. * * @return * 0 on success @@ -56,10 +59,8 @@ * -ENOTSUP if unsupported */ __rte_experimental -int rte_power_monitor(const volatile void *p, - const uint64_t expected_value, const uint64_t value_mask, - const uint64_t tsc_timestamp, const uint8_t data_sz); - +int rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp); /** * @warning * @b EXPERIMENTAL: this API may change without prior notice @@ -80,20 +81,11 @@ int rte_power_monitor(const volatile void *p, * @warning It is responsibility of the user to check if this function is * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. * - * @param p - * Address to monitor for changes. - * @param expected_value - * Before attempting the monitoring, the `p` address may be read and compared - * against this value. If `value_mask` is zero, this step will be skipped. - * @param value_mask - * The 64-bit mask to use to extract current value from `p`. + * @param pmc + * The monitoring condition structure. * @param tsc_timestamp * Maximum TSC timestamp to wait for. Note that the wait behavior is * architecture-dependent. - * @param data_sz - * Data size (in bytes) that will be used to compare expected value with the - * memory address. Can be 1, 2, 4 or 8. Supplying any other value will lead - * to undefined result. * @param lck * A spinlock that must be locked before entering the function, will be * unlocked while the CPU is sleeping, and will be locked again once the CPU @@ -105,10 +97,8 @@ int rte_power_monitor(const volatile void *p, * -ENOTSUP if unsupported */ __rte_experimental -int rte_power_monitor_sync(const volatile void *p, - const uint64_t expected_value, const uint64_t value_mask, - const uint64_t tsc_timestamp, const uint8_t data_sz, - rte_spinlock_t *lck); +int rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck); /** * @warning diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c index 929e0611b0..5e5a1fff5a 100644 --- a/lib/librte_eal/ppc/rte_power_intrinsics.c +++ b/lib/librte_eal/ppc/rte_power_intrinsics.c @@ -8,15 +8,11 @@ * This function is not supported on PPC64. */ int -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); - RTE_SET_USED(data_sz); return -ENOTSUP; } @@ -25,16 +21,12 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * This function is not supported on PPC64. */ int -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { - RTE_SET_USED(p); - RTE_SET_USED(expected_value); - RTE_SET_USED(value_mask); + RTE_SET_USED(pmc); RTE_SET_USED(tsc_timestamp); RTE_SET_USED(lck); - RTE_SET_USED(data_sz); return -ENOTSUP; } diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c index 2a38440bec..6be5c8b9f1 100644 --- a/lib/librte_eal/x86/rte_power_intrinsics.c +++ b/lib/librte_eal/x86/rte_power_intrinsics.c @@ -46,9 +46,8 @@ __check_val_size(const uint8_t sz) * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ int -rte_power_monitor(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz) +rte_power_monitor(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); @@ -57,7 +56,10 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, if (!wait_supported) return -ENOTSUP; - if (__check_val_size(data_sz) < 0) + if (pmc == NULL) + return -EINVAL; + + if (__check_val_size(pmc->data_sz) < 0) return -EINVAL; /* @@ -68,14 +70,15 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, /* set address for UMONITOR */ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" : - : "D"(p)); + : "D"(pmc->addr)); - if (value_mask) { - const uint64_t cur_value = __get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; + if (pmc->mask) { + const uint64_t cur_value = __get_umwait_val( + pmc->addr, pmc->data_sz); + const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == expected_value) + if (masked == pmc->val) return 0; } /* execute UMWAIT */ @@ -93,9 +96,8 @@ rte_power_monitor(const volatile void *p, const uint64_t expected_value, * Intel(R) 64 and IA-32 Architectures Software Developer's Manual. */ int -rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, - const uint64_t value_mask, const uint64_t tsc_timestamp, - const uint8_t data_sz, rte_spinlock_t *lck) +rte_power_monitor_sync(const struct rte_power_monitor_cond *pmc, + const uint64_t tsc_timestamp, rte_spinlock_t *lck) { const uint32_t tsc_l = (uint32_t)tsc_timestamp; const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); @@ -104,7 +106,10 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, if (!wait_supported) return -ENOTSUP; - if (__check_val_size(data_sz) < 0) + if (pmc == NULL || lck == NULL) + return -EINVAL; + + if (__check_val_size(pmc->data_sz) < 0) return -EINVAL; /* @@ -115,14 +120,15 @@ rte_power_monitor_sync(const volatile void *p, const uint64_t expected_value, /* set address for UMONITOR */ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;" : - : "D"(p)); + : "D"(pmc->addr)); - if (value_mask) { - const uint64_t cur_value = __get_umwait_val(p, data_sz); - const uint64_t masked = cur_value & value_mask; + if (pmc->mask) { + const uint64_t cur_value = __get_umwait_val( + pmc->addr, pmc->data_sz); + const uint64_t masked = cur_value & pmc->mask; /* if the masked value is already matching, abort */ - if (masked == expected_value) + if (masked == pmc->val) return 0; } rte_spinlock_unlock(lck); -- 2.25.1 ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 03/11] eal: change API of " Anatoly Burakov @ 2021-01-18 22:26 ` Thomas Monjalon 2021-01-19 10:29 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Thomas Monjalon @ 2021-01-18 22:26 UTC (permalink / raw) To: Anatoly Burakov Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, david.hunt, chris.macnamara 14/01/2021 15:46, Anatoly Burakov: > +struct rte_power_monitor_cond { > + volatile void *addr; /**< Address to monitor for changes */ > + uint64_t val; /**< Before attempting the monitoring, the address > + * may be read and compared against this value. "may" be read and compared? Is there a case where there is no read and compare? > + **/ > + uint64_t mask; /**< 64-bit mask to extract current value from addr */ > + uint8_t data_sz; /**< Data size (in bytes) that will be used to compare > + * expected value with the memory address. Can be 1, > + * 2, 4, or 8. Supplying any other value will lead to > + * undefined result. */ Other parameters are not prefixed with "data_", so I think this field could be simply named "size". > +}; I understand this struct is a direct translation of what existed in 20.11 as function parameters and comments. If you agree, these comments could be addressed in a separate patch. ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-18 22:26 ` Thomas Monjalon @ 2021-01-19 10:29 ` Burakov, Anatoly 2021-01-19 10:42 ` Thomas Monjalon 0 siblings, 1 reply; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-19 10:29 UTC (permalink / raw) To: Thomas Monjalon Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, david.hunt, chris.macnamara On 18-Jan-21 10:26 PM, Thomas Monjalon wrote: > 14/01/2021 15:46, Anatoly Burakov: >> +struct rte_power_monitor_cond { >> + volatile void *addr; /**< Address to monitor for changes */ >> + uint64_t val; /**< Before attempting the monitoring, the address >> + * may be read and compared against this value. > > "may" be read and compared? > Is there a case where there is no read and compare? Yes, if the mask is not set. > >> + **/ >> + uint64_t mask; /**< 64-bit mask to extract current value from addr */ >> + uint8_t data_sz; /**< Data size (in bytes) that will be used to compare >> + * expected value with the memory address. Can be 1, >> + * 2, 4, or 8. Supplying any other value will lead to >> + * undefined result. */ > > Other parameters are not prefixed with "data_", > so I think this field could be simply named "size". OK. > >> +}; > > I understand this struct is a direct translation of what existed > in 20.11 as function parameters and comments. > If you agree, these comments could be addressed in a separate patch. > I'll be respinning anyway, so might as well do some quick fixups. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-19 10:29 ` Burakov, Anatoly @ 2021-01-19 10:42 ` Thomas Monjalon 2021-01-19 11:23 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Thomas Monjalon @ 2021-01-19 10:42 UTC (permalink / raw) To: Burakov, Anatoly Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, david.hunt, chris.macnamara 19/01/2021 11:29, Burakov, Anatoly: > On 18-Jan-21 10:26 PM, Thomas Monjalon wrote: > > 14/01/2021 15:46, Anatoly Burakov: > >> +struct rte_power_monitor_cond { > >> + volatile void *addr; /**< Address to monitor for changes */ > >> + uint64_t val; /**< Before attempting the monitoring, the address > >> + * may be read and compared against this value. > > > > "may" be read and compared? > > Is there a case where there is no read and compare? > > Yes, if the mask is not set. If the mask is not set, the address is "read" anyway or it is only "watched" for any change? Sorry the mechanism is really not clear to me. ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-19 10:42 ` Thomas Monjalon @ 2021-01-19 11:23 ` Burakov, Anatoly 2021-01-19 14:17 ` Thomas Monjalon 0 siblings, 1 reply; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-19 11:23 UTC (permalink / raw) To: Thomas Monjalon Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, david.hunt, chris.macnamara On 19-Jan-21 10:42 AM, Thomas Monjalon wrote: > 19/01/2021 11:29, Burakov, Anatoly: >> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote: >>> 14/01/2021 15:46, Anatoly Burakov: >>>> +struct rte_power_monitor_cond { >>>> + volatile void *addr; /**< Address to monitor for changes */ >>>> + uint64_t val; /**< Before attempting the monitoring, the address >>>> + * may be read and compared against this value. >>> >>> "may" be read and compared? >>> Is there a case where there is no read and compare? >> >> Yes, if the mask is not set. > > If the mask is not set, the address is "read" anyway > or it is only "watched" for any change? > > Sorry the mechanism is really not clear to me. > The "value" is only used to avoid the sleep, i.e. to check if the write has already happened. We're waiting on *a write* rather than *a value*, so it's not equivalent to "wait until equal" call. It's more of a "sleep until something happens". -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-19 11:23 ` Burakov, Anatoly @ 2021-01-19 14:17 ` Thomas Monjalon 2021-01-20 10:32 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Thomas Monjalon @ 2021-01-19 14:17 UTC (permalink / raw) To: Burakov, Anatoly Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, david.hunt, chris.macnamara 19/01/2021 12:23, Burakov, Anatoly: > On 19-Jan-21 10:42 AM, Thomas Monjalon wrote: > > 19/01/2021 11:29, Burakov, Anatoly: > >> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote: > >>> 14/01/2021 15:46, Anatoly Burakov: > >>>> +struct rte_power_monitor_cond { > >>>> + volatile void *addr; /**< Address to monitor for changes */ > >>>> + uint64_t val; /**< Before attempting the monitoring, the address > >>>> + * may be read and compared against this value. > >>> > >>> "may" be read and compared? > >>> Is there a case where there is no read and compare? > >> > >> Yes, if the mask is not set. > > > > If the mask is not set, the address is "read" anyway > > or it is only "watched" for any change? > > > > Sorry the mechanism is really not clear to me. > > > > The "value" is only used to avoid the sleep, i.e. to check if the write > has already happened. We're waiting on *a write* rather than *a value*, > so it's not equivalent to "wait until equal" call. It's more of a "sleep > until something happens". Please make things explicit in doxygen. The behaviour of each case should be explained crystal clear. Thanks ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-19 14:17 ` Thomas Monjalon @ 2021-01-20 10:32 ` Burakov, Anatoly 2021-01-20 10:38 ` Thomas Monjalon 0 siblings, 1 reply; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-20 10:32 UTC (permalink / raw) To: Thomas Monjalon Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, david.hunt, chris.macnamara On 19-Jan-21 2:17 PM, Thomas Monjalon wrote: > 19/01/2021 12:23, Burakov, Anatoly: >> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote: >>> 19/01/2021 11:29, Burakov, Anatoly: >>>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote: >>>>> 14/01/2021 15:46, Anatoly Burakov: >>>>>> +struct rte_power_monitor_cond { >>>>>> + volatile void *addr; /**< Address to monitor for changes */ >>>>>> + uint64_t val; /**< Before attempting the monitoring, the address >>>>>> + * may be read and compared against this value. >>>>> >>>>> "may" be read and compared? >>>>> Is there a case where there is no read and compare? >>>> >>>> Yes, if the mask is not set. >>> >>> If the mask is not set, the address is "read" anyway >>> or it is only "watched" for any change? >>> >>> Sorry the mechanism is really not clear to me. >>> >> >> The "value" is only used to avoid the sleep, i.e. to check if the write >> has already happened. We're waiting on *a write* rather than *a value*, >> so it's not equivalent to "wait until equal" call. It's more of a "sleep >> until something happens". > > Please make things explicit in doxygen. > The behaviour of each case should be explained crystal clear. > Thanks > > It is explained in the comments to `rte_power_monitor()` call. But OK, i'll add more clarification for the struct too. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-20 10:32 ` Burakov, Anatoly @ 2021-01-20 10:38 ` Thomas Monjalon 2021-01-20 11:05 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Thomas Monjalon @ 2021-01-20 10:38 UTC (permalink / raw) To: Burakov, Anatoly Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, david.hunt, chris.macnamara 20/01/2021 11:32, Burakov, Anatoly: > On 19-Jan-21 2:17 PM, Thomas Monjalon wrote: > > 19/01/2021 12:23, Burakov, Anatoly: > >> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote: > >>> 19/01/2021 11:29, Burakov, Anatoly: > >>>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote: > >>>>> 14/01/2021 15:46, Anatoly Burakov: > >>>>>> +struct rte_power_monitor_cond { > >>>>>> + volatile void *addr; /**< Address to monitor for changes */ > >>>>>> + uint64_t val; /**< Before attempting the monitoring, the address > >>>>>> + * may be read and compared against this value. > >>>>> > >>>>> "may" be read and compared? > >>>>> Is there a case where there is no read and compare? > >>>> > >>>> Yes, if the mask is not set. > >>> > >>> If the mask is not set, the address is "read" anyway > >>> or it is only "watched" for any change? > >>> > >>> Sorry the mechanism is really not clear to me. > >>> > >> > >> The "value" is only used to avoid the sleep, i.e. to check if the write > >> has already happened. We're waiting on *a write* rather than *a value*, > >> so it's not equivalent to "wait until equal" call. It's more of a "sleep > >> until something happens". > > > > Please make things explicit in doxygen. > > The behaviour of each case should be explained crystal clear. > > Thanks > > It is explained in the comments to `rte_power_monitor()` call. But OK, > i'll add more clarification for the struct too. Please avoid the word "may" in API description. This is what is explained in rte_power_monitor: " * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If * mask is non-zero, the current value pointed to by the `p` pointer will be * checked against the expected value, and if they match, the entering of * optimized power state may be aborted. " Can we replace "may" by "will"? ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-20 10:38 ` Thomas Monjalon @ 2021-01-20 11:05 ` Burakov, Anatoly 2021-01-20 11:11 ` Thomas Monjalon 0 siblings, 1 reply; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-20 11:05 UTC (permalink / raw) To: Thomas Monjalon Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, david.hunt, chris.macnamara On 20-Jan-21 10:38 AM, Thomas Monjalon wrote: > 20/01/2021 11:32, Burakov, Anatoly: >> On 19-Jan-21 2:17 PM, Thomas Monjalon wrote: >>> 19/01/2021 12:23, Burakov, Anatoly: >>>> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote: >>>>> 19/01/2021 11:29, Burakov, Anatoly: >>>>>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote: >>>>>>> 14/01/2021 15:46, Anatoly Burakov: >>>>>>>> +struct rte_power_monitor_cond { >>>>>>>> + volatile void *addr; /**< Address to monitor for changes */ >>>>>>>> + uint64_t val; /**< Before attempting the monitoring, the address >>>>>>>> + * may be read and compared against this value. >>>>>>> >>>>>>> "may" be read and compared? >>>>>>> Is there a case where there is no read and compare? >>>>>> >>>>>> Yes, if the mask is not set. >>>>> >>>>> If the mask is not set, the address is "read" anyway >>>>> or it is only "watched" for any change? >>>>> >>>>> Sorry the mechanism is really not clear to me. >>>>> >>>> >>>> The "value" is only used to avoid the sleep, i.e. to check if the write >>>> has already happened. We're waiting on *a write* rather than *a value*, >>>> so it's not equivalent to "wait until equal" call. It's more of a "sleep >>>> until something happens". >>> >>> Please make things explicit in doxygen. >>> The behaviour of each case should be explained crystal clear. >>> Thanks >> >> It is explained in the comments to `rte_power_monitor()` call. But OK, >> i'll add more clarification for the struct too. > > Please avoid the word "may" in API description. > > This is what is explained in rte_power_monitor: > " > * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If > * mask is non-zero, the current value pointed to by the `p` pointer will be > * checked against the expected value, and if they match, the entering of > * optimized power state may be aborted. > " > > Can we replace "may" by "will"? > Yep, we can. However, the "may" part was intended to leave some wiggle room for a different implementation, should the need arise, and i find "will" to be needlessly prescriptive. Frankly, i do not see the need for such a detailed description of what the API does under the hood, as long as it's clear what its effects are. The main purpose is waiting for a write. The mask is only used to check whether the expected write has already happened by the time we're calling the API. Whether the CPU then does or does not go to sleep is not really relevant IMO. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-20 11:05 ` Burakov, Anatoly @ 2021-01-20 11:11 ` Thomas Monjalon 2021-01-20 11:17 ` Burakov, Anatoly 0 siblings, 1 reply; 421+ messages in thread From: Thomas Monjalon @ 2021-01-20 11:11 UTC (permalink / raw) To: Burakov, Anatoly Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, david.hunt, chris.macnamara, david.marchand, jerinj, ajit.khaparde, honnappa.nagarahalli, David Christensen 20/01/2021 12:05, Burakov, Anatoly: > On 20-Jan-21 10:38 AM, Thomas Monjalon wrote: > > 20/01/2021 11:32, Burakov, Anatoly: > >> On 19-Jan-21 2:17 PM, Thomas Monjalon wrote: > >>> 19/01/2021 12:23, Burakov, Anatoly: > >>>> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote: > >>>>> 19/01/2021 11:29, Burakov, Anatoly: > >>>>>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote: > >>>>>>> 14/01/2021 15:46, Anatoly Burakov: > >>>>>>>> +struct rte_power_monitor_cond { > >>>>>>>> + volatile void *addr; /**< Address to monitor for changes */ > >>>>>>>> + uint64_t val; /**< Before attempting the monitoring, the address > >>>>>>>> + * may be read and compared against this value. > >>>>>>> > >>>>>>> "may" be read and compared? > >>>>>>> Is there a case where there is no read and compare? > >>>>>> > >>>>>> Yes, if the mask is not set. > >>>>> > >>>>> If the mask is not set, the address is "read" anyway > >>>>> or it is only "watched" for any change? > >>>>> > >>>>> Sorry the mechanism is really not clear to me. > >>>>> > >>>> > >>>> The "value" is only used to avoid the sleep, i.e. to check if the write > >>>> has already happened. We're waiting on *a write* rather than *a value*, > >>>> so it's not equivalent to "wait until equal" call. It's more of a "sleep > >>>> until something happens". > >>> > >>> Please make things explicit in doxygen. > >>> The behaviour of each case should be explained crystal clear. > >>> Thanks > >> > >> It is explained in the comments to `rte_power_monitor()` call. But OK, > >> i'll add more clarification for the struct too. > > > > Please avoid the word "may" in API description. > > > > This is what is explained in rte_power_monitor: > > " > > * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If > > * mask is non-zero, the current value pointed to by the `p` pointer will be > > * checked against the expected value, and if they match, the entering of > > * optimized power state may be aborted. > > " > > > > Can we replace "may" by "will"? > > > > Yep, we can. However, the "may" part was intended to leave some wiggle > room for a different implementation, should the need arise, and i find > "will" to be needlessly prescriptive. Frankly, i do not see the need for > such a detailed description of what the API does under the hood, as long > as it's clear what its effects are. The main purpose is waiting for a > write. The mask is only used to check whether the expected write has > already happened by the time we're calling the API. Whether the CPU then > does or does not go to sleep is not really relevant IMO. I think it is relevant but I may be wrong. Any other opinions? ^ permalink raw reply [flat|nested] 421+ messages in thread
* Re: [dpdk-dev] [PATCH v17 03/11] eal: change API of power intrinsics 2021-01-20 11:11 ` Thomas Monjalon @ 2021-01-20 11:17 ` Burakov, Anatoly 0 siblings, 0 replies; 421+ messages in thread From: Burakov, Anatoly @ 2021-01-20 11:17 UTC (permalink / raw) To: Thomas Monjalon Cc: dev, Timothy McDaniel, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Bruce Richardson, Konstantin Ananyev, david.hunt, chris.macnamara, david.marchand, ajit.khaparde, honnappa.nagarahalli On 20-Jan-21 11:11 AM, Thomas Monjalon wrote: > 20/01/2021 12:05, Burakov, Anatoly: >> On 20-Jan-21 10:38 AM, Thomas Monjalon wrote: >>> 20/01/2021 11:32, Burakov, Anatoly: >>>> On 19-Jan-21 2:17 PM, Thomas Monjalon wrote: >>>>> 19/01/2021 12:23, Burakov, Anatoly: >>>>>> On 19-Jan-21 10:42 AM, Thomas Monjalon wrote: >>>>>>> 19/01/2021 11:29, Burakov, Anatoly: >>>>>>>> On 18-Jan-21 10:26 PM, Thomas Monjalon wrote: >>>>>>>>> 14/01/2021 15:46, Anatoly Burakov: >>>>>>>>>> +struct rte_power_monitor_cond { >>>>>>>>>> + volatile void *addr; /**< Address to monitor for changes */ >>>>>>>>>> + uint64_t val; /**< Before attempting the monitoring, the address >>>>>>>>>> + * may be read and compared against this value. >>>>>>>>> >>>>>>>>> "may" be read and compared? >>>>>>>>> Is there a case where there is no read and compare? >>>>>>>> >>>>>>>> Yes, if the mask is not set. >>>>>>> >>>>>>> If the mask is not set, the address is "read" anyway >>>>>>> or it is only "watched" for any change? >>>>>>> >>>>>>> Sorry the mechanism is really not clear to me. >>>>>>> >>>>>> >>>>>> The "value" is only used to avoid the sleep, i.e. to check if the write >>>>>> has already happened. We're waiting on *a write* rather than *a value*, >>>>>> so it's not equivalent to "wait until equal" call. It's more of a "sleep >>>>>> until something happens". >>>>> >>>>> Please make things explicit in doxygen. >>>>> The behaviour of each case should be explained crystal clear. >>>>> Thanks >>>> >>>> It is explained in the comments to `rte_power_monitor()` call. But OK, >>>> i'll add more clarification for the struct too. >>> >>> Please avoid the word "may" in API description. >>> >>> This is what is explained in rte_power_monitor: >>> " >>> * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If >>> * mask is non-zero, the current value pointed to by the `p` pointer will be >>> * checked against the expected value, and if they match, the entering of >>> * optimized power state may be aborted. >>> " >>> >>> Can we replace "may" by "will"? >>> >> >> Yep, we can. However, the "may" part was intended to leave some wiggle >> room for a different implementation, should the need arise, and i find >> "will" to be needlessly prescriptive. Frankly, i do not see the need for >> such a detailed description of what the API does under the hood, as long >> as it's clear what its effects are. The main purpose is waiting for a >> write. The mask is only used to check whether the expected write has >> already happened by the time we're calling the API. Whether the CPU then >> does or does not go to sleep is not really relevant IMO. > > I think it is relevant but I may be wrong. > Any other opinions? > I have no objection in documenting that further, so i'll go ahead and do it :) It's just that i think it's not necessary to be *this* detailed about how the API does things. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 421+ messages in thread
* [dpdk-dev] [PATCH v17 04/11] eal: remove sync version of power monitor 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 " Anatoly Burakov ` (2 preceding siblings ...) 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 03/11] eal: change API of " Anatoly Burakov @ 2021-01-14 14:46 ` Anatoly Burakov 2021-01-14 14:46 ` [dpdk-dev] [PATCH v17 05/11] eal: add monitor wakeup function Anatoly Burakov ` (8 subsequent siblings) 12 siblings, 0 replies; 421+ messages in thread From: Anatoly Burakov @ 2021-01-14 14:46 UTC (permalink / raw) To: dev Cc: Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen, Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev, thomas, timothy.mcdaniel, david.hunt, chris.macnamara Currently, the "sync" version of power monitor intrinsic is supposed to be used for purposes of waking up a sleeping core. However, there are better ways to achieve the same result, so remove the unneeded function. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> --- lib/librte_eal/arm/rte_power_intrinsics.c | 14 ----- .../include/generic/rte_power_intrinsics.h | 38 ------------- lib/librte_eal/ppc/rte_power_intrinsics.c | 14 ----- lib/librte_eal/version.map | 1 - lib/librte_eal/x86/rte_power_intrinsics.c | 54 ------------------- 5 files changed, 121 deletions(-)