From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 49C96A04E7; Mon, 2 Nov 2020 12:12:19 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id DD439C820; Mon, 2 Nov 2020 12:11:40 +0100 (CET) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id F138DC7F4 for ; Mon, 2 Nov 2020 12:11:36 +0100 (CET) IronPort-SDR: 9kUqwoPb4kfF/ITPxQdlpusROMDTJdDj2RVRmKMIIJYWAdkhf99nrgq7yJ9zLougDwyodcaJCV 8zRM0MwoIxRg== X-IronPort-AV: E=McAfee;i="6000,8403,9792"; a="155855072" X-IronPort-AV: E=Sophos;i="5.77,444,1596524400"; d="scan'208";a="155855072" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Nov 2020 03:11:35 -0800 IronPort-SDR: vq4kauUhRrbbrVhcMp76KQyfzT7axusQmyzfEmIFjYtxG04fAd5ZKomCXarYnU08S1/Z/v2XAF Rzxp0KKukzcA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,444,1596524400"; d="scan'208";a="351974950" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga008.jf.intel.com with ESMTP; 02 Nov 2020 03:11:31 -0800 Received: from sivswdev09.ir.intel.com (sivswdev09.ir.intel.com [10.237.217.48]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id 0A2BBUx9011972; Mon, 2 Nov 2020 11:11:30 GMT Received: from sivswdev09.ir.intel.com (localhost [127.0.0.1]) by sivswdev09.ir.intel.com with ESMTP id 0A2BBUls028128; Mon, 2 Nov 2020 11:11:30 GMT Received: (from lma25@localhost) by sivswdev09.ir.intel.com with LOCAL id 0A2BBUj2028124; Mon, 2 Nov 2020 11:11:30 GMT From: Liang Ma To: dev@dpdk.org Cc: ruifeng.wang@arm.com, haiyue.wang@intel.com, bruce.richardson@intel.com, konstantin.ananyev@intel.com, david.hunt@intel.com, jerinjacobk@gmail.com, nhorman@tuxdriver.com, thomas@monjalon.net, timothy.mcdaniel@intel.com, gage.eads@intel.com, mw@semihalf.com, gtzalik@amazon.com, ajit.khaparde@broadcom.com, hkalra@marvell.com, johndale@cisco.com, matan@nvidia.com, yongwang@vmware.com, Anatoly Burakov , Ray Kinsella Date: Mon, 2 Nov 2020 11:10:01 +0000 Message-Id: <1604315406-27669-3-git-send-email-liang.j.ma@intel.com> X-Mailer: git-send-email 1.7.7.4 In-Reply-To: <1604315406-27669-1-git-send-email-liang.j.ma@intel.com> References: <2772eb151ccba5cc17186e6161d8834176924753.1590598121.git.anatoly.burakov@intel.com> <1604315406-27669-1-git-send-email-liang.j.ma@intel.com> Subject: [dpdk-dev] [PATCH v11 2/6] power: add PMD power management API and callback X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Add a simple on/off switch that will enable saving power when no packets are arriving. It is based on counting the number of empty polls and, when the number reaches a certain threshold, entering an architecture-defined optimized power state that will either wait until a TSC timestamp expires, or when packets arrive. This API mandates a core-to-single-queue mapping (that is, multiple queued per device are supported, but they have to be polled on different cores). This design is using PMD RX callbacks. The following are the available schemes: 1. UMWAIT/UMONITOR: When a certain threshold of empty polls is reached, the core will go into a power optimized sleep while waiting on an address of next RX descriptor to be written to. 2. Pause instruction Instead of move the core into deeper C state, this method uses the pause instruction to avoid busy polling. 3. Frequency scaling Reuse existing DPDK power library to scale up/down core frequency depending on traffic volume. Signed-off-by: Liang Ma Signed-off-by: Anatoly Burakov Acked-by: David Hunt Acked-by: Konstantin Ananyev --- Notes: v11: - Updated power library document - Updated release notes v10: - Updated power library document v8: - Rename version map file name v7: - Fixed race condition (Konstantin) - Slight rework of the structure of monitor code - Added missing inline for wakeup v6: - Added wakeup mechanism for UMWAIT - Removed memory allocation (everything is now allocated statically) - Fixed various typos and comments - Check for invalid queue ID - Moved release notes to this patch v5: - Make error checking more robust - Prevent initializing scaling if ACPI or PSTATE env wasn't set - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr - Add some debug logging - Replace x86-specific code path to generic path using the intrinsic check --- doc/guides/prog_guide/power_man.rst | 51 ++++ doc/guides/rel_notes/release_20_11.rst | 15 +- lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_pmd_mgmt.c | 320 +++++++++++++++++++++++++ lib/librte_power/rte_power_pmd_mgmt.h | 92 +++++++ lib/librte_power/version.map | 4 + 6 files changed, 484 insertions(+), 3 deletions(-) create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index 0a3755a901..380e7aace7 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -192,6 +192,54 @@ User Cases ---------- The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA. +PMD Power Management API +------------------------ + +Abstract +~~~~~~~~ + +Existing Power Management mechanisms require developers to change the design of +an application or change code to make use of it. The PMD Power Management API +provides a convenient alternative by use Ethernet PMD RX callbacks, and +triggering power saving whenever the empty poll count reaches a certain number. + +There are multiple power saving schemes available for the developer to choose. +Although the developer can configure each queue with different scheme, It's +strongly recommended to configure the queue within the same port with the same +scheme. + +The following are the available schemes: + + * UMWAIT/UMONITOR + + This power saving scheme will put the core into an optimized power state and + use the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX + descriptor address, and wake the CPU up whenever there's new traffic. + + * Pause + + This power saving scheme will use the "rte_pause" function to reduce the impact + of busy polling. + + * Frequency scaling + + This power saving scheme will use existing power library functionality to + scale the core frequency up/down depending on traffic volume. + + +.. note:: + + Currently, this Power Management API is limited to mapping of 1 queue to 1 + core (multiple queues are supported, but they must be polled from different + cores). + +API Overview for PMD Power Management +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +* **Queue Enable**: Enable a specific power scheme for a certain queue/port/core + +* **Queue Disable**: Disable power scheme for certain queue/port/core + References ---------- @@ -200,3 +248,6 @@ References * The :doc:`../sample_app_ug/vm_power_management` chapter in the :doc:`../sample_app_ug/index` section. + +* The :doc:`../sample_app_ug/rxtx_callbacks` + chapter in the :doc:`../sample_app_ug/index` section. diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst index e95e6aa7a5..0430fca9cc 100644 --- a/doc/guides/rel_notes/release_20_11.rst +++ b/doc/guides/rel_notes/release_20_11.rst @@ -150,7 +150,9 @@ New Features * **ethdev: added 1 new API for PMD power management.** - * ``rte_eth_get_wake_addr()`` is added to get the wake up address from device. + * ``rte_eth_get_wake_addr()`` function has been added to allow applications to + fetch the wake up information from the device. Processor need that information + to wake up from the low power state. * **Updated Broadcom bnxt driver.** @@ -362,6 +364,17 @@ New Features * Replaced ``--scalar`` command-line option with ``--alg=``, to allow the user to select the desired classify method. +* **Added PMD power management mechanism** + + The new Ethernet PMD Power Management mechanisms have been added through the + existing RX callback infrastructure. + + * Added power saving scheme based on UMWAIT instruction (x86 only) + * Added power saving scheme based on ``rte_pause()`` + * Added power saving scheme based on frequency scaling through the power library + * Added new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()`` + * Added new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()`` + Removed Items ------------- diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build index 78c031c943..cc3c7a8646 100644 --- a/lib/librte_power/meson.build +++ b/lib/librte_power/meson.build @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c', 'power_kvm_vm.c', 'guest_channel.c', 'rte_power_empty_poll.c', 'power_pstate_cpufreq.c', + 'rte_power_pmd_mgmt.c', 'power_common.c') -headers = files('rte_power.h','rte_power_empty_poll.h') -deps += ['timer'] +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h') +deps += ['timer' ,'ethdev'] diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c new file mode 100644 index 0000000000..0dcaddc3bd --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.c @@ -0,0 +1,320 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#include +#include +#include +#include +#include +#include + +#include "rte_power_pmd_mgmt.h" + +#define EMPTYPOLL_MAX 512 + +/** + * Possible power management states of an ethdev port. + */ +enum pmd_mgmt_state { + /** Device power management is disabled. */ + PMD_MGMT_DISABLED = 0, + /** Device power management is enabled. */ + PMD_MGMT_ENABLED, +}; + +struct pmd_queue_cfg { + enum pmd_mgmt_state pwr_mgmt_state; + /**< State of power management for this queue */ + enum rte_power_pmd_mgmt_type cb_mode; + /**< Callback mode for this queue */ + const struct rte_eth_rxtx_callback *cur_cb; + /**< Callback instance */ + rte_spinlock_t umwait_lock; + /**< Per-queue status lock - used only for UMWAIT mode */ + volatile void *wait_addr; + /**< UMWAIT wakeup address */ + uint64_t empty_poll_stats; + /**< Number of empty polls */ +} __rte_cache_aligned; + +static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT]; + +/* trigger a write to the cache line we're waiting on */ +static inline void +umwait_wakeup(volatile void *addr) +{ + uint64_t val; + + val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED); + __atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0, + __ATOMIC_RELAXED, __ATOMIC_RELAXED); +} + +static inline void +umwait_sleep(struct pmd_queue_cfg *q_conf, uint16_t port_id, uint16_t qidx) +{ + volatile void *target_addr; + uint64_t expected, mask; + uint8_t data_sz; + uint16_t ret; + + /* + * get wake up address for this RX queue, as well as expected value, + * comparison mask, and data size. + */ + ret = rte_eth_get_wake_addr(port_id, qidx, &target_addr, + &expected, &mask, &data_sz); + + /* this should always succeed as all checks have been done already */ + if (unlikely(ret != 0)) + return; + + /* + * take out a spinlock to prevent control plane from concurrently + * modifying the wakeup data. + */ + rte_spinlock_lock(&q_conf->umwait_lock); + + /* have we been disabled by control plane? */ + if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) { + /* we're good to go */ + + /* + * store the wakeup address so that control plane can trigger a + * write to this address and wake us up. + */ + q_conf->wait_addr = target_addr; + /* -1ULL is maximum value for TSC */ + rte_power_monitor_sync(target_addr, expected, mask, -1ULL, + data_sz, &q_conf->umwait_lock); + /* erase the address */ + q_conf->wait_addr = NULL; + } + rte_spinlock_unlock(&q_conf->umwait_lock); +} + +static uint16_t +clb_umwait(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + umwait_sleep(q_conf, port_id, qidx); + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_pause(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + /* sleep for 1 microsecond */ + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + rte_delay_us(1); + } else + q_conf->empty_poll_stats = 0; + + return nb_rx; +} + +static uint16_t +clb_scale_freq(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *_ __rte_unused) +{ + struct pmd_queue_cfg *q_conf; + + q_conf = &port_cfg[port_id][qidx]; + + if (unlikely(nb_rx == 0)) { + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) + /* scale down freq */ + rte_power_freq_min(rte_lcore_id()); + } else { + q_conf->empty_poll_stats = 0; + /* scale up freq */ + rte_power_freq_max(rte_lcore_id()); + } + + return nb_rx; +} + +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode) +{ + struct rte_eth_dev *dev; + struct pmd_queue_cfg *queue_cfg; + int ret; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + dev = &rte_eth_devices[port_id]; + + /* check if queue id is valid */ + if (queue_id >= dev->data->nb_rx_queues || + queue_id >= RTE_MAX_QUEUES_PER_PORT) { + return -EINVAL; + } + + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) { + ret = -EINVAL; + goto end; + } + + switch (mode) { + case RTE_POWER_MGMT_TYPE_WAIT: + { + /* check if rte_power_monitor is supported */ + uint64_t dummy_expected, dummy_mask; + struct rte_cpu_intrinsics i; + volatile void *dummy_addr; + uint8_t dummy_sz; + + rte_cpu_get_intrinsics_support(&i); + + if (!i.power_monitor) { + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); + ret = -ENOTSUP; + goto end; + } + + /* check if the device supports the necessary PMD API */ + if (rte_eth_get_wake_addr(port_id, queue_id, + &dummy_addr, &dummy_expected, + &dummy_mask, &dummy_sz) == -ENOTSUP) { + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n"); + ret = -ENOTSUP; + goto end; + } + /* initialize UMWAIT spinlock */ + rte_spinlock_init(&queue_cfg->umwait_lock); + + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_umwait, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_SCALE: + { + enum power_management_env env; + /* only PSTATE and ACPI modes are supported */ + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && + !rte_power_check_env_supported( + PM_ENV_PSTATE_CPUFREQ)) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); + ret = -ENOTSUP; + goto end; + } + /* ensure we could initialize the power library */ + if (rte_power_init(lcore_id)) { + ret = -EINVAL; + goto end; + } + /* ensure we initialized the correct env */ + env = rte_power_get_env(); + if (env != PM_ENV_ACPI_CPUFREQ && + env != PM_ENV_PSTATE_CPUFREQ) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); + ret = -ENOTSUP; + goto end; + } + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, + queue_id, clb_scale_freq, NULL); + break; + } + case RTE_POWER_MGMT_TYPE_PAUSE: + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb_pause, NULL); + break; + } + ret = 0; + +end: + return ret; +} + +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id) +{ + struct pmd_queue_cfg *queue_cfg; + int ret; + + queue_cfg = &port_cfg[port_id][queue_id]; + + if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) { + ret = -EINVAL; + goto end; + } + + switch (queue_cfg->cb_mode) { + case RTE_POWER_MGMT_TYPE_WAIT: + rte_spinlock_lock(&queue_cfg->umwait_lock); + + /* wake up the core from UMWAIT sleep, if any */ + if (queue_cfg->wait_addr != NULL) + umwait_wakeup(queue_cfg->wait_addr); + /* + * we need to disable early as there might be callback currently + * spinning on a lock. + */ + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + + rte_spinlock_unlock(&queue_cfg->umwait_lock); + /* fall-through */ + case RTE_POWER_MGMT_TYPE_PAUSE: + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + break; + case RTE_POWER_MGMT_TYPE_SCALE: + rte_power_freq_max(lcore_id); + rte_eth_remove_rx_callback(port_id, queue_id, + queue_cfg->cur_cb); + rte_power_exit(lcore_id); + break; + } + /* + * we don't free the RX callback here because it is unsafe to do so + * unless we know for a fact that all data plane threads have stopped. + */ + queue_cfg->cur_cb = NULL; + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + ret = 0; +end: + return ret; +} diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h new file mode 100644 index 0000000000..a7a3f98268 --- /dev/null +++ b/lib/librte_power/rte_power_pmd_mgmt.h @@ -0,0 +1,92 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2020 Intel Corporation + */ + +#ifndef _RTE_POWER_PMD_MGMT_H +#define _RTE_POWER_PMD_MGMT_H + +/** + * @file + * RTE PMD Power Management + */ +#include +#include + +#include +#include +#include +#include +#include + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * PMD Power Management Type + */ +enum rte_power_pmd_mgmt_type { + /** WAIT callback mode. */ + RTE_POWER_MGMT_TYPE_WAIT = 1, + /** PAUSE callback mode. */ + RTE_POWER_MGMT_TYPE_PAUSE, + /** Freq Scaling callback mode. */ + RTE_POWER_MGMT_TYPE_SCALE, +}; + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Setup per-queue power management callback. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @param mode + * The power management callback function type. + + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id, + uint16_t port_id, + uint16_t queue_id, + enum rte_power_pmd_mgmt_type mode); + +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Remove per-queue power management callback. + * + * @note This function is not thread-safe. + * + * @param lcore_id + * lcore_id. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, + uint16_t port_id, + uint16_t queue_id); +#ifdef __cplusplus +} +#endif + +#endif diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map index 69ca9af616..3f2f6cd6f6 100644 --- a/lib/librte_power/version.map +++ b/lib/librte_power/version.map @@ -34,4 +34,8 @@ EXPERIMENTAL { rte_power_guest_channel_receive_msg; rte_power_poll_stat_fetch; rte_power_poll_stat_update; + # added in 20.11 + rte_power_pmd_mgmt_queue_enable; + rte_power_pmd_mgmt_queue_disable; + }; -- 2.17.1