From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id BCE21A04B7; Wed, 14 Oct 2020 15:36:00 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 45A931DE8F; Wed, 14 Oct 2020 15:30:52 +0200 (CEST) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 06CD61DE72 for ; Wed, 14 Oct 2020 15:30:39 +0200 (CEST) IronPort-SDR: Gch85eJFj+L64rRCKEa/3tF5p9n/Wb70tMoYdM508UTIiYiD6PVlP0uIhifBhuPb12WxotXIU8 9oqb2JGc3DIw== X-IronPort-AV: E=McAfee;i="6000,8403,9773"; a="183621983" X-IronPort-AV: E=Sophos;i="5.77,374,1596524400"; d="scan'208";a="183621983" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Oct 2020 06:30:39 -0700 IronPort-SDR: pfOmsdJZ4gYFetj4wHYQlhc14Rfn5IRfAxgGypZqJyeD2rpITL1sk6WhjQ8iJP/uj+/T6iypAw NfEgo+5lTb4A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,374,1596524400"; d="scan'208";a="520378244" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.222.52]) by fmsmga006.fm.intel.com with ESMTP; 14 Oct 2020 06:30:36 -0700 From: Anatoly Burakov To: dev@dpdk.org Cc: Liang Ma , David Hunt , Ray Kinsella , Neil Horman , konstantin.ananyev@intel.com, jerinjacobk@gmail.com, bruce.richardson@intel.com, thomas@monjalon.net, timothy.mcdaniel@intel.com, gage.eads@intel.com, chris.macnamara@intel.com Date: Wed, 14 Oct 2020 14:30:18 +0100 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: <532f45c5d79b4c30a919553d322bb66e91534466.1602258833.git.anatoly.burakov@intel.com> References: <532f45c5d79b4c30a919553d322bb66e91534466.1602258833.git.anatoly.burakov@intel.com> Subject: [dpdk-dev] [PATCH v6 05/10] power: add PMD power management API and callback X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" From: Liang Ma Add a simple on/off switch that will enable saving power when no packets are arriving. It is based on counting the number of empty polls and, when the number reaches a certain threshold, entering an architecture-defined optimized power state that will either wait until a TSC timestamp expires, or when packets arrive. This API mandates a core-to-single-queue mapping (that is, multiple queued per device are supported, but they have to be polled on different cores). This design is using PMD RX callbacks. 1. UMWAIT/UMONITOR: When a certain threshold of empty polls is reached, the core will go into a power optimized sleep while waiting on an address of next RX descriptor to be written to. 2. Pause instruction Instead of move the core into deeper C state, this method uses the pause instruction to avoid busy polling. 3. Frequency scaling Reuse existing DPDK power library to scale up/down core frequency depending on traffic volume. Signed-off-by: Liang Ma Signed-off-by: Anatoly Burakov --- Notes: v6: - Added wakeup mechanism for UMWAIT - Removed memory allocation (everything is now allocated statically) - Fixed various typos and comments - Check for invalid queue ID - Moved release notes to this patch v5: - Make error checking more robust - Prevent initializing scaling if ACPI or PSTATE env wasn't set - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr - Add some debug logging - Replace x86-specific code path to generic path using the intrinsic check doc/guides/rel_notes/release_20_11.rst | 11 + lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 300 +++++++++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 92 ++++++++ lib/librte_power/rte_power_version.map | 4 + 5 files changed, 410 insertions(+), 2 deletions(-) create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst index ca4f43f7f9..06b822aa36 100644 --- a/doc/guides/rel_notes/release_20_11.rst +++ b/doc/guides/rel_notes/release_20_11.rst @@ -197,6 +197,17 @@ New Features * Added new ``RTE_ACL_CLASSIFY_AVX512X32`` vector implementation, which can process up to 32 flows in parallel. Requires AVX512 support. +* **Add PMD power management mechanism** + + 3 new Ethernet PMD power management mechanism is added through existing + RX callback infrastructure. + + * Add power saving scheme based on UMWAIT instruction (x86 only) + * Add power saving scheme based on ``rte_pause()`` + * Add power saving scheme based on frequency scaling through the power library + * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()`` + * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()`` + Removed Items ------------- diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build index 78c031c943..cc3c7a8646 100644 --- a/lib/librte_power/meson.build +++ b/lib/librte_power/meson.build @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c', 'power_kvm_vm.c', 'guest_channel.c', 'rte_power_empty_poll.c', 'power_pstate_cpufreq.c', + 'rte_power_pmd_mgmt.c', 'power_common.c') -headers = files('rte_power.h','rte_power_empty_poll.h') -deps += ['timer'] +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h') +deps += ['timer' ,'ethdev'] diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c new file mode 100644 index 0000000000..2b7d2a1a46 --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.c @@ -0,0 +1,300 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include +#include +#include +#include +#include +#include + +#include "rte_power_pmd_mgmt.h" + +#define EMPTYPOLL_MAX 512 + +/** + * Possible power management states of an ethdev port. + */ +enum pmd_mgmt_state { + /** Device power management is disabled. */ + PMD_MGMT_DISABLED = 0, + /** Device power management is enabled. */ + PMD_MGMT_ENABLED, +}; + +struct pmd_queue_cfg { + enum pmd_mgmt_state pwr_mgmt_state; + /**< State of power management for this queue */ + enum rte_power_pmd_mgmt_type cb_mode; + /**< Callback mode for this queue */ + const struct rte_eth_rxtx_callback *cur_cb; + /**< Callback instance */ + rte_spinlock_t umwait_lock; + /**< Per-queue status lock - used only for UMWAIT mode */ + volatile void *wait_addr; + /**< UMWAIT wakeup address */ + uint64_t empty_poll_stats; + /**< Number of empty polls */ +} __rte_cache_aligned; + +static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT]; + +/* trigger a write to the cache line we're waiting on */ +static void umwait_wakeup(volatile void *addr) +{ + uint64_t val; + + val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED); + __atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0, + __ATOMIC_RELAXED, __ATOMIC_RELAXED); +} + +static uint16_t +clb_umwait(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + volatile void *target_addr; + uint64_t expected, mask; + uint16_t ret; + uint8_t data_sz; + + /* + * get address of next descriptor in the RX + * ring for this queue, as well as expected + * value and a mask. + */ + ret = rte_eth_get_wake_addr(port_id, qidx, + &target_addr, &expected, &mask, + &data_sz); + if (ret == 0) { + /* + * we need to ensure we can wake up by another + * thread triggering a write, so we need the + * address to always be up to date. + */ + rte_spinlock_lock(&q_conf->umwait_lock); + q_conf->wait_addr = target_addr; + /* -1ULL is maximum value for TSC */ + rte_power_monitor_sync(target_addr, expected, + mask, -1ULL, data_sz, + &q_conf->umwait_lock); + /* erase the address */ + q_conf->wait_addr = NULL; + rte_spinlock_unlock(&q_conf->umwait_lock); + } + } + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_pause(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + /* sleep for 1 microsecond */ + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + rte_delay_us(1); + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_scale_freq(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *_ __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + /* scale down freq */ + rte_power_freq_min(rte_lcore_id()); + } else { + q_conf->empty_poll_stats = 0; + /* scale up freq */ + rte_power_freq_max(rte_lcore_id()); + } + + return nb_rx; +} + +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode) +{ + struct rte_eth_dev *dev; + struct pmd_queue_cfg *queue_cfg; + int ret; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + dev = &rte_eth_devices[port_id]; + + /* check if queue id is valid */ + if (queue_id >= dev->data->nb_rx_queues || + queue_id >= RTE_MAX_QUEUES_PER_PORT) { + return -EINVAL; + } + + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) { + ret = -EINVAL; + goto end; + } + + switch (mode) { + case RTE_POWER_MGMT_TYPE_WAIT: + { + /* check if rte_power_monitor is supported */ + uint64_t dummy_expected, dummy_mask; + struct rte_cpu_intrinsics i; + volatile void *dummy_addr; + uint8_t dummy_sz; + + rte_cpu_get_intrinsics_support(&i); + + if (!i.power_monitor) { + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); + ret = -ENOTSUP; + goto end; + } + + /* check if the device supports the necessary PMD API */ + if (rte_eth_get_wake_addr(port_id, queue_id, + &dummy_addr, &dummy_expected, + &dummy_mask, &dummy_sz) == -ENOTSUP) { + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n"); + ret = -ENOTSUP; + goto end; + } + /* initialize UMWAIT spinlock */ + rte_spinlock_init(&queue_cfg->umwait_lock); + + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_umwait, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_SCALE: + { + enum power_management_env env; + /* only PSTATE and ACPI modes are supported */ + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && + !rte_power_check_env_supported( + PM_ENV_PSTATE_CPUFREQ)) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); + ret = -ENOTSUP; + goto end; + } + /* ensure we could initialize the power library */ + if (rte_power_init(lcore_id)) { + ret = -EINVAL; + goto end; + } + /* ensure we initialized the correct env */ + env = rte_power_get_env(); + if (env != PM_ENV_ACPI_CPUFREQ && + env != PM_ENV_PSTATE_CPUFREQ) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); + ret = -ENOTSUP; + goto end; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, + queue_id, clb_scale_freq, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_PAUSE: + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_pause, NULL); + break; + } + ret = 0; + +end: + return ret; +} + +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id) +{ + struct pmd_queue_cfg *queue_cfg; + int ret; + + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) { + ret = -EINVAL; + goto end; + } + + switch (queue_cfg->cb_mode) { + case RTE_POWER_MGMT_TYPE_WAIT: + rte_spinlock_lock(&queue_cfg->umwait_lock); + + /* wake up the core from UMWAIT sleep, if any */ + if (queue_cfg->wait_addr != NULL) + umwait_wakeup(queue_cfg->wait_addr); + + rte_spinlock_unlock(&queue_cfg->umwait_lock); + /* fall-through */ + case RTE_POWER_MGMT_TYPE_PAUSE: + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + break; + case RTE_POWER_MGMT_TYPE_SCALE: + rte_power_freq_max(lcore_id); + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + rte_power_exit(lcore_id); + break; + } + /* + * we don't free the RX callback here because it is unsafe to do so + * unless we know for a fact that all data plane threads have stopped. + */ + queue_cfg->cur_cb = NULL; + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + ret = 0; +end: + return ret; +} diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h new file mode 100644 index 0000000000..a7a3f98268 --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.h @@ -0,0 +1,92 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#ifndef _RTE_POWER_PMD_MGMT_H +#define _RTE_POWER_PMD_MGMT_H + +/** + * @file + * RTE PMD Power Management + */ +#include +#include + +#include +#include +#include +#include +#include + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * PMD Power Management Type + */ +enum rte_power_pmd_mgmt_type { + /** WAIT callback mode. */ + RTE_POWER_MGMT_TYPE_WAIT = 1, + /** PAUSE callback mode. */ + RTE_POWER_MGMT_TYPE_PAUSE, + /** Freq Scaling callback mode. */ + RTE_POWER_MGMT_TYPE_SCALE, +}; + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Setup per-queue power management callback. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @param mode + * The power management callback function type. + + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, + uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode); + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Remove per-queue power management callback. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, + uint16_t queue_id); +#ifdef __cplusplus +} +#endif + +#endif diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map index 69ca9af616..3f2f6cd6f6 100644 --- a/lib/librte_power/rte_power_version.map +++ b/lib/librte_power/rte_power_version.map @@ -34,4 +34,8 @@ EXPERIMENTAL { rte_power_guest_channel_receive_msg; rte_power_poll_stat_fetch; rte_power_poll_stat_update; + # added in 20.11 + rte_power_pmd_mgmt_queue_enable; + rte_power_pmd_mgmt_queue_disable; + }; -- 2.17.1