From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by dpdk.org (Postfix) with ESMTP id ECFDC2BC7 for ; Tue, 2 Oct 2018 12:13:23 +0200 (CEST) X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga106.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 02 Oct 2018 03:13:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,331,1534834800"; d="scan'208";a="88464380" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga003.jf.intel.com with ESMTP; 02 Oct 2018 03:13:20 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w92ADKpK013046; Tue, 2 Oct 2018 11:13:20 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w92ADK4c001776; Tue, 2 Oct 2018 11:13:20 +0100 Received: (from lma25@localhost) by sivswdev01.ir.intel.com with LOCAL id w92ADJIo001772; Tue, 2 Oct 2018 11:13:19 +0100 Date: Tue, 2 Oct 2018 11:13:19 +0100 From: "Liang, Ma" To: "Hunt, David" Cc: dev@dpdk.org, lei.a.yao@intel.com, ktraynor@redhat.com, john.geary@intel.com Message-ID: <20181002101319.GA12694@sivswdev01.ir.intel.com> References: <1536070228-6545-1-git-send-email-liang.j.ma@intel.com> <1537191016-26330-1-git-send-email-liang.j.ma@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.4.2.3i Subject: Re: [dpdk-dev] [PATCH v8 1/4] lib/librte_power: traffic pattern aware power control X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Oct 2018 10:13:24 -0000 Hi Dave, Please check comment below. On 28 Sep 11:47, Hunt, David wrote: > Hi Liang, > > > On 17/9/2018 2:30 PM, Liang Ma wrote: > >1. Abstract > > > >For packet processing workloads such as DPDK polling is continuous. > >This means CPU cores always show 100% busy independent of how much work > >those cores are doing. It is critical to accurately determine how busy > >a core is hugely important for the following reasons: > > > > * No indication of overload conditions > > > > * User do not know how much real load is on a system meaning resulted > > in > > wasted energy as no power management is utilized > > > >Compared to the original l3fwd-power design, instead of going to sleep > >after detecting an empty poll, the new mechanism just lowers the core > >frequency. As a result, the application does not stop polling the device, > >which leads to improved handling of bursts of traffic. > > > >When the system become busy, the empty poll mechanism can also increase the > >core frequency (including turbo) to do best effort for intensive traffic. > >This gives us more flexible and balanced traffic awareness over the > >standard l3fwd-power application. > > > >2. Proposed solution > > > >The proposed solution focuses on how many times empty polls are executed. > >The less the number of empty polls, means current core is busy with > >processing workload, therefore, the higher frequency is needed. The high > >empty poll number indicates the current core not doing any real work > >therefore, we can lower the frequency to safe power. > > > >In the current implementation, each core has 1 empty-poll counter which > >assume 1 core is dedicated to 1 queue. This will need to be expanded in the > >future to support multiple queues per core. > > > >2.1 Power state definition: > > > > LOW: Not currently used, reserved for future use. > > > > MED: the frequency is used to process modest traffic workload. > > > > HIGH: the frequency is used to process busy traffic workload. > > > >2.2 There are two phases to establish the power management system: > > > > a.Initialization/Training phase. The training phase is necessary > > in order to figure out the system polling baseline numbers from > > idle to busy. The highest poll count will be during idle, where > > all polls are empty. These poll counts will be different between > > systems due to the many possible processor micro-arch, cache > > and device configurations, hence the training phase. > > In the training phase, traffic is blocked so the training > > algorithm can average the empty-poll numbers for the LOW, MED and > > HIGH power states in order to create a baseline. > > The core's counter are collected every 10ms, and the Training > > phase will take 2 seconds. > > Training is disabled as default configuration. the default > > parameter is applied. Simple App still can trigger training > > Typo: "Simple" should be "Sample" > > Suggest adding: Once the training phase has been executed once on a > system, the application > can then be started with the relevant thresholds provided on the command > line, allowing the > application to start passing start traffic immediately. agree > > > if that's needed. > > > > b.Normal phase. When the training phase is complete, traffic is > > started. The run-time poll counts are compared with the > > baseline and the decision will be taken to move to MED power > > state or HIGH power state. The counters are calculated every 10ms. > > Propose changing the first sentence:  Traffic starts immediately based > on the default > thresholds, or based on the user supplied thresholds via the command > line parameters. > agree > > > > >3. Proposed API > > > >1. rte_power_empty_poll_stat_init(struct ep_params **eptr, > > uint8_t *freq_tlb, struct ep_policy *policy); > >which is used to initialize the power management system. > > > >2. rte_power_empty_poll_stat_free(void); > >which is used to free the resource hold by power management system. > > > >3. rte_power_empty_poll_stat_update(unsigned int lcore_id); > >which is used to update specific core empty poll counter, not thread safe > > > >4. rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt); > >which is used to update specific core valid poll counter, not thread safe > > > >5. rte_power_empty_poll_stat_fetch(unsigned int lcore_id); > >which is used to get specific core empty poll counter. > > > >6. rte_power_poll_stat_fetch(unsigned int lcore_id); > >which is used to get specific core valid poll counter. > > > >7. rte_empty_poll_detection(struct rte_timer *tim, void *arg); > >which is used to detect empty poll state changes then take action. > > > >ChangeLog: > >v2: fix some coding style issues. > >v3: rename the filename, API name. > >v4: no change. > >v5: no change. > >v6: re-work the code layout, update API. > >v7: fix minor typo and lift node num limit. > >v8: disable training as default option. > > > >Signed-off-by: Liang Ma > > > >Reviewed-by: Lei Yao > >--- > > lib/librte_power/Makefile | 6 +- > > lib/librte_power/meson.build | 5 +- > > lib/librte_power/rte_power_empty_poll.c | 539 > > ++++++++++++++++++++++++++++++++ > > lib/librte_power/rte_power_empty_poll.h | 219 +++++++++++++ > > lib/librte_power/rte_power_version.map | 13 + > > 5 files changed, 778 insertions(+), 4 deletions(-) > > create mode 100644 lib/librte_power/rte_power_empty_poll.c > > create mode 100644 lib/librte_power/rte_power_empty_poll.h > > > >diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile > >index 6f85e88..a8f1301 100644 > >--- a/lib/librte_power/Makefile > >+++ b/lib/librte_power/Makefile > >@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk > > # library name > > LIB = librte_power.a > > > >+CFLAGS += -DALLOW_EXPERIMENTAL_API > > CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing > >-LDLIBS += -lrte_eal > >+LDLIBS += -lrte_eal -lrte_timer > > > > EXPORT_MAP := rte_power_version.map > > > >@@ -16,8 +17,9 @@ LIBABIVER := 1 > > # all source are stored in SRCS-y > > SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c > > SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c > >+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c > > > > # install this header file > >-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h > >+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h > >rte_power_empty_poll.h > > > > include $(RTE_SDK)/mk/rte.lib.mk > >diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build > >index 253173f..63957eb 100644 > >--- a/lib/librte_power/meson.build > >+++ b/lib/librte_power/meson.build > >@@ -5,5 +5,6 @@ if host_machine.system() != 'linux' > > build = false > > endif > > sources = files('rte_power.c', 'power_acpi_cpufreq.c', > >- 'power_kvm_vm.c', 'guest_channel.c') > >-headers = files('rte_power.h') > >+ 'power_kvm_vm.c', 'guest_channel.c', > >+ 'rte_power_empty_poll.c') > >+headers = files('rte_power.h','rte_power_empty_poll.h') > >diff --git a/lib/librte_power/rte_power_empty_poll.c > >b/lib/librte_power/rte_power_empty_poll.c > >new file mode 100644 > >index 0000000..587ce78 > >--- /dev/null > >+++ b/lib/librte_power/rte_power_empty_poll.c > >@@ -0,0 +1,539 @@ > >+/* SPDX-License-Identifier: BSD-3-Clause > >+ * Copyright(c) 2010-2018 Intel Corporation > >+ */ > >+ > >+#include > >+ > >+#include > >+#include > >+#include > >+#include > >+ > >+#include "rte_power.h" > >+#include "rte_power_empty_poll.h" > >+ > >+#define INTERVALS_PER_SECOND 100 /* (10ms) */ > >+#define SECONDS_TO_TRAIN_FOR 2 > >+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70 > >+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30 > >+#define DEFAULT_CYCLES_PER_PACKET 800 > >+ > >+static struct ep_params *ep_params; > >+static uint32_t med_to_high_threshold = > >DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD; > >+static uint32_t high_to_med_threshold = > >DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD; > >+ > >+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS]; > >+ > >+static uint32_t total_avail_freqs[RTE_MAX_LCORE]; > >+ > >+static uint32_t freq_index[NUM_FREQ]; > >+ > >+static uint32_t > >+get_freq_index(enum freq_val index) > >+{ > >+ return freq_index[index]; > >+} > >+ > >+ > >+static int > >+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq) > >+{ > >+ int err = 0; > >+ uint32_t power_freq_index; > >+ if (!specific_freq) > >+ power_freq_index = get_freq_index(freq); > >+ else > >+ power_freq_index = freq; > >+ > >+ err = rte_power_set_freq(lcore_id, power_freq_index); > >+ > >+ return err; > >+} > >+ > >+ > >+static inline void __attribute__((always_inline)) > >+exit_training_state(struct priority_worker *poll_stats) > >+{ > >+ RTE_SET_USED(poll_stats); > >+} > >+ > > Is this really needed? It does nothing, and is just local to this file. > this is needed for debug purpose, I prefer keep it. > > >+static inline void __attribute__((always_inline)) > >+enter_training_state(struct priority_worker *poll_stats) > >+{ > >+ poll_stats->iter_counter = 0; > >+ poll_stats->cur_freq = LOW; > >+ poll_stats->queue_state = TRAINING; > >+} > >+ > >+static inline void __attribute__((always_inline)) > >+enter_normal_state(struct priority_worker *poll_stats) > >+{ > >+ /* Clear the averages arrays and strs */ > >+ memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av)); > >+ poll_stats->ec = 0; > >+ memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av)); > >+ poll_stats->pc = 0; > >+ > >+ poll_stats->cur_freq = MED; > >+ poll_stats->iter_counter = 0; > >+ poll_stats->threshold_ctr = 0; > >+ poll_stats->queue_state = MED_NORMAL; > >+ RTE_LOG(INFO, POWER, "set the poewr freq to MED\n"); > > Typo, "poewr" should be "power", also suggest "Set" rather than "set" > agree > > >+ set_power_freq(poll_stats->lcore_id, MED, false); > >+ > >+ /* Try here */ > > Not sure about this comment? will be removed > > >+ poll_stats->thresh[MED].threshold_percent = med_to_high_threshold; > >+ poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold; > >+} > >+ > >+static inline void __attribute__((always_inline)) > >+enter_busy_state(struct priority_worker *poll_stats) > >+{ > >+ memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av)); > >+ poll_stats->ec = 0; > >+ memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av)); > >+ poll_stats->pc = 0; > >+ > >+ poll_stats->cur_freq = HGH; > >+ poll_stats->iter_counter = 0; > >+ poll_stats->threshold_ctr = 0; > >+ poll_stats->queue_state = HGH_BUSY; > >+ set_power_freq(poll_stats->lcore_id, HGH, false); > >+} > >+ > >+static inline void __attribute__((always_inline)) > >+enter_purge_state(struct priority_worker *poll_stats) > >+{ > >+ poll_stats->iter_counter = 0; > >+ poll_stats->queue_state = LOW_PURGE; > >+} > >+ > >+static inline void __attribute__((always_inline)) > >+set_state(struct priority_worker *poll_stats, > >+ enum queue_state new_state) > >+{ > >+ enum queue_state old_state = poll_stats->queue_state; > >+ if (old_state != new_state) { > >+ > >+ /* Call any old state exit functions */ > >+ if (old_state == TRAINING) > >+ exit_training_state(poll_stats); > > Is this needed? exit_training_state() does nothing. > original code is used for debug purpose. we can leave it. > >+ /* Call any new state entry functions */ > >+ if (new_state == TRAINING) > >+ enter_training_state(poll_stats); > >+ if (new_state == MED_NORMAL) > >+ enter_normal_state(poll_stats); > >+ if (new_state == HGH_BUSY) > >+ enter_busy_state(poll_stats); > >+ if (new_state == LOW_PURGE) > >+ enter_purge_state(poll_stats); > >+ } > >+} > >+ > >+static inline void __attribute__((always_inline)) > >+set_policy(struct priority_worker *poll_stats, > >+ struct ep_policy *policy) > >+{ > >+ set_state(poll_stats, policy->state); > >+ > >+ if (policy->state == TRAINING) > >+ return; > >+ > >+ poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi; > >+ poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi; > >+ > >+ poll_stats->thresh[MED_NORMAL].trained = true; > >+ poll_stats->thresh[HGH_BUSY].trained = true; > >+ > >+} > >+ > >+static void > >+update_training_stats(struct priority_worker *poll_stats, > >+ uint32_t freq, > >+ bool specific_freq, > >+ uint32_t max_train_iter) > >+{ > >+ RTE_SET_USED(specific_freq); > >+ > >+ char pfi_str[32]; > >+ uint64_t p0_empty_deq; > >+ > >+ sprintf(pfi_str, "%02d", freq); > >+ > >+ if (poll_stats->cur_freq == freq && > >+ poll_stats->thresh[freq].trained == false) { > >+ if (poll_stats->thresh[freq].cur_train_iter == 0) { > >+ > >+ set_power_freq(poll_stats->lcore_id, > >+ freq, specific_freq); > >+ > >+ poll_stats->empty_dequeues_prev = > >+ poll_stats->empty_dequeues; > >+ > >+ poll_stats->thresh[freq].cur_train_iter++; > >+ > >+ return; > >+ } else if (poll_stats->thresh[freq].cur_train_iter > >+ <= max_train_iter) { > >+ > >+ p0_empty_deq = poll_stats->empty_dequeues - > >+ poll_stats->empty_dequeues_prev; > >+ > >+ poll_stats->empty_dequeues_prev = > >+ poll_stats->empty_dequeues; > >+ > >+ poll_stats->thresh[freq].base_edpi += p0_empty_deq; > >+ poll_stats->thresh[freq].cur_train_iter++; > >+ > >+ } else { > >+ if (poll_stats->thresh[freq].trained == false) { > >+ poll_stats->thresh[freq].base_edpi = > >+ poll_stats->thresh[freq].base_edpi / > >+ max_train_iter; > >+ > >+ /* Add on a factor of 0.05%*/ > >+ /* this should remove any */ > >+ /* false negatives when the system is 0% > >busy */ > > Multi line comment should follow the usual standard /* \n * text \n text > \n */ > agree > >+ poll_stats->thresh[freq].base_edpi += > >+ poll_stats->thresh[freq].base_edpi / > >2000; > >+ > >+ poll_stats->thresh[freq].trained = true; > >+ poll_stats->cur_freq++; > >+ > >+ } > >+ } > >+ } > >+} > >+ > >+static inline uint32_t __attribute__((always_inline)) > >+update_stats(struct priority_worker *poll_stats) > >+{ > >+ uint64_t tot_edpi = 0, tot_ppi = 0; > >+ uint32_t j, percent; > >+ > >+ struct priority_worker *s = poll_stats; > >+ > >+ uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev; > >+ > >+ s->empty_dequeues_prev = s->empty_dequeues; > >+ > >+ uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev; > >+ > >+ s->num_dequeue_pkts_prev = s->num_dequeue_pkts; > >+ > >+ if (s->thresh[s->cur_freq].base_edpi < cur_edpi) { > >+ RTE_LOG(DEBUG, POWER, "cur_edpi is too large " > >+ "cur edpi %ld " > >+ "base epdi %ld\n", > >+ cur_edpi, > >+ s->thresh[s->cur_freq].base_edpi); > > Suggest making this log message more meaningful to the user. I suspect > that "cur_edpi" will not mean much to the user. > What does edpi mean? > agree > >+ /* Value to make us fail need debug log*/ > >+ return 1000UL; > >+ } > >+ > >+ s->edpi_av[s->ec++ % BINS_AV] = cur_edpi; > >+ s->ppi_av[s->pc++ % BINS_AV] = ppi; > >+ > >+ for (j = 0; j < BINS_AV; j++) { > >+ tot_edpi += s->edpi_av[j]; > >+ tot_ppi += s->ppi_av[j]; > >+ } > >+ > >+ tot_edpi = tot_edpi / BINS_AV; > >+ > >+ percent = 100 - (uint32_t)(((float)tot_edpi / > >+ (float)s->thresh[s->cur_freq].base_edpi) * 100); > >+ > >+ return (uint32_t)percent; > >+} > >+ > >+ > >+static inline void __attribute__((always_inline)) > >+update_stats_normal(struct priority_worker *poll_stats) > >+{ > >+ uint32_t percent; > >+ > >+ if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) { > >+ > >+ RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n", > >+ poll_stats->cur_freq, > >+ poll_stats->thresh[poll_stats->cur_freq].base_edpi); > > Again, a more meaningful explanation of edpi is needed here. > agree > >+ return; > >+ } > >+ > >+ percent = update_stats(poll_stats); > >+ > >+ if (percent > 100) { > >+ RTE_LOG(DEBUG, POWER, "Big than 100 abnormal\n"); > > Please change to something meaningful to the user. What is the > percentage returned from update_stats()? > agree > >+ return; > >+ } > >+ > >+ if (poll_stats->cur_freq == LOW) > >+ RTE_LOG(INFO, POWER, "Purge Mode is not supported\n"); > > Suggest adding "currently" - "Purge Mode is not currently supported\n" > agree > >+ else if (poll_stats->cur_freq == MED) { > >+ > >+ if (percent > > >+ poll_stats->thresh[MED].threshold_percent) { > >+ > >+ if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND) > >+ poll_stats->threshold_ctr++; > >+ else { > >+ set_state(poll_stats, HGH_BUSY); > >+ RTE_LOG(INFO, POWER, "MOVE to HGH\n"); > >+ } > >+ > >+ } else { > >+ /* reset */ > >+ poll_stats->threshold_ctr = 0; > >+ } > >+ > >+ } else if (poll_stats->cur_freq == HGH) { > >+ > >+ if (percent < > >+ poll_stats->thresh[HGH].threshold_percent) { > >+ > >+ if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND) > >+ poll_stats->threshold_ctr++; > >+ else { > >+ set_state(poll_stats, MED_NORMAL); > >+ RTE_LOG(INFO, POWER, "MOVE to MED\n"); > >+ } > >+ } else { > >+ /* reset */ > >+ poll_stats->threshold_ctr = 0; > >+ } > >+ > >+ } > >+} > >+ > >+static int > >+empty_poll_training(struct priority_worker *poll_stats, > >+ uint32_t max_train_iter) > >+{ > >+ > >+ if (poll_stats->iter_counter < INTERVALS_PER_SECOND) { > >+ poll_stats->iter_counter++; > >+ return 0; > >+ } > >+ > >+ > >+ update_training_stats(poll_stats, > >+ LOW, > >+ false, > >+ max_train_iter); > >+ > >+ update_training_stats(poll_stats, > >+ MED, > >+ false, > >+ max_train_iter); > >+ > >+ update_training_stats(poll_stats, > >+ HGH, > >+ false, > >+ max_train_iter); > >+ > >+ > >+ if (poll_stats->thresh[LOW].trained == true > >+ && poll_stats->thresh[MED].trained == true > >+ && poll_stats->thresh[HGH].trained == true) { > >+ > >+ set_state(poll_stats, MED_NORMAL); > >+ > >+ RTE_LOG(INFO, POWER, "Low threshold is %lu\n", > >+ poll_stats->thresh[LOW].base_edpi); > > Suggest "Low" change to "LOW" for consistency with other log messages below. > agree > >+ > >+ RTE_LOG(INFO, POWER, "MED threshold is %lu\n", > >+ poll_stats->thresh[MED].base_edpi); > >+ > >+ > >+ RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n", > >+ poll_stats->thresh[HGH].base_edpi); > >+ > >+ RTE_LOG(INFO, POWER, "Training is Complete for %d\n", > >+ poll_stats->lcore_id); > >+ } > >+ > >+ return 0; > >+} > >+ > >+void > >+rte_empty_poll_detection(struct rte_timer *tim, void *arg) > >+{ > >+ > >+ uint32_t i; > >+ > >+ struct priority_worker *poll_stats; > >+ > >+ RTE_SET_USED(tim); > >+ > >+ RTE_SET_USED(arg); > >+ > >+ for (i = 0; i < NUM_NODES; i++) { > >+ > >+ poll_stats = &(ep_params->wrk_data.wrk_stats[i]); > >+ > >+ if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0) > >+ continue; > >+ > >+ switch (poll_stats->queue_state) { > >+ case(TRAINING): > >+ empty_poll_training(poll_stats, > >+ ep_params->max_train_iter); > >+ break; > >+ > >+ case(HGH_BUSY): > >+ case(MED_NORMAL): > >+ update_stats_normal(poll_stats); > >+ break; > >+ > >+ case(LOW_PURGE): > >+ break; > >+ default: > >+ break; > >+ > >+ } > >+ > >+ } > >+ > >+} > >+ > >+int __rte_experimental > >+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb, > >+ struct ep_policy *policy) > >+{ > >+ uint32_t i; > >+ /* Allocate the ep_params structure */ > >+ ep_params = rte_zmalloc_socket(NULL, > >+ sizeof(struct ep_params), > >+ 0, > >+ rte_socket_id()); > >+ > >+ if (!ep_params) > >+ rte_panic("Cannot allocate heap memory for ep_params " > >+ "for socket %d\n", rte_socket_id()); > >+ > >+ if (freq_tlb == NULL) { > >+ freq_index[LOW] = 14; > >+ freq_index[MED] = 9; > >+ freq_index[HGH] = 1; > >+ } else { > >+ freq_index[LOW] = freq_tlb[LOW]; > >+ freq_index[MED] = freq_tlb[MED]; > >+ freq_index[HGH] = freq_tlb[HGH]; > >+ } > >+ > >+ RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n"); > >+ > >+ /* 5 seconds worth of training */ > > This now looks to be 2 seconds from the #define above. Maybe: /* Train > for pre-defined period */ > agree > >+ ep_params->max_train_iter = INTERVALS_PER_SECOND * > >SECONDS_TO_TRAIN_FOR; > >+ > >+ struct stats_data *w = &ep_params->wrk_data; > >+ > >+ *eptr = ep_params; > >+ > >+ /* initialize all wrk_stats state */ > >+ for (i = 0; i < NUM_NODES; i++) { > >+ > >+ if (rte_lcore_is_enabled(i) == 0) > >+ continue; > >+ /*init the freqs table */ > >+ total_avail_freqs[i] = rte_power_freqs(i, > >+ avail_freqs[i], > >+ NUM_FREQS); > >+ > >+ RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n", > >+ total_avail_freqs[i], > >+ i); > >+ > >+ if (get_freq_index(LOW) > total_avail_freqs[i]) > >+ return -1; > >+ > >+ if (rte_get_master_lcore() != i) { > >+ w->wrk_stats[i].lcore_id = i; > >+ set_policy(&w->wrk_stats[i], policy); > >+ } > >+ } > >+ > >+ return 0; > >+} > >+ > >+void __rte_experimental > >+rte_power_empty_poll_stat_free(void) > >+{ > >+ > >+ RTE_LOG(INFO, POWER, "Close the Empty Poll\n"); > >+ > >+ if (ep_params != NULL) > >+ rte_free(ep_params); > >+} > >+ > >+int __rte_experimental > >+rte_power_empty_poll_stat_update(unsigned int lcore_id) > >+{ > >+ struct priority_worker *poll_stats; > >+ > >+ if (lcore_id >= NUM_NODES) > >+ return -1; > >+ > >+ poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]); > >+ > >+ if (poll_stats->lcore_id == 0) > >+ poll_stats->lcore_id = lcore_id; > >+ > >+ poll_stats->empty_dequeues++; > >+ > >+ return 0; > >+} > >+ > >+int __rte_experimental > >+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt) > >+{ > >+ > >+ struct priority_worker *poll_stats; > >+ > >+ if (lcore_id >= NUM_NODES) > >+ return -1; > >+ > >+ poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]); > >+ > >+ if (poll_stats->lcore_id == 0) > >+ poll_stats->lcore_id = lcore_id; > >+ > >+ poll_stats->num_dequeue_pkts += nb_pkt; > >+ > >+ return 0; > >+} > >+ > >+ > >+uint64_t __rte_experimental > >+rte_power_empty_poll_stat_fetch(unsigned int lcore_id) > >+{ > >+ struct priority_worker *poll_stats; > >+ > >+ if (lcore_id >= NUM_NODES) > >+ return -1; > >+ > >+ poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]); > >+ > >+ if (poll_stats->lcore_id == 0) > >+ poll_stats->lcore_id = lcore_id; > >+ > >+ return poll_stats->empty_dequeues; > >+} > >+ > >+uint64_t __rte_experimental > >+rte_power_poll_stat_fetch(unsigned int lcore_id) > >+{ > >+ struct priority_worker *poll_stats; > >+ > >+ if (lcore_id >= NUM_NODES) > >+ return -1; > >+ > >+ poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]); > >+ > >+ if (poll_stats->lcore_id == 0) > >+ poll_stats->lcore_id = lcore_id; > >+ > >+ return poll_stats->num_dequeue_pkts; > >+} > >diff --git a/lib/librte_power/rte_power_empty_poll.h > >b/lib/librte_power/rte_power_empty_poll.h > >new file mode 100644 > >index 0000000..ae27f7d > >--- /dev/null > >+++ b/lib/librte_power/rte_power_empty_poll.h > >@@ -0,0 +1,219 @@ > >+/* SPDX-License-Identifier: BSD-3-Clause > >+ * Copyright(c) 2010-2018 Intel Corporation > >+ */ > >+ > >+#ifndef _RTE_EMPTY_POLL_H > >+#define _RTE_EMPTY_POLL_H > >+ > >+/** > >+ * @file > >+ * RTE Power Management > >+ */ > >+#include > >+#include > >+ > >+#include > >+#include > >+#include > >+#include > >+#include > >+#include > >+ > >+#ifdef __cplusplus > >+extern "C" { > >+#endif > >+ > >+#define NUM_FREQS 20 > > I don't think this is enough. Suggest using RTE_MAX_LCORE_FREQS > agree > >+ > >+#define BINS_AV 4 /* Has to be ^2 */ > >+ > >+#define DROP (NUM_DIRECTIONS * NUM_DEVICES) > >+ > >+#define NUM_PRIORITIES 2 > >+ > >+#define NUM_NODES 256 /* Max core number*/ > >+ > >+/* Processor Power State */ > >+enum freq_val { > >+ LOW, > >+ MED, > >+ HGH, > >+ NUM_FREQ = NUM_FREQS > >+}; > >+ > > Why is NUM_FREQ in this enum? 0,1,2,20 (or RTE_MAX_LCORE_FREQS) does not > seem right. > If you are using NUM_FREQ in the code then why not just use NUM_FREQS. > the reason is to have some spare space in case we need extend to more stage. > >+ > >+/* Queue Polling State */ > >+enum queue_state { > >+ TRAINING, /* NO TRAFFIC */ > >+ MED_NORMAL, /* MED */ > >+ HGH_BUSY, /* HIGH */ > >+ LOW_PURGE, /* LOW */ > >+}; > >+ > >+/* Queue Stats */ > >+struct freq_threshold { > >+ > >+ uint64_t base_edpi; > >+ bool trained; > >+ uint32_t threshold_percent; > >+ uint32_t cur_train_iter; > >+}; > >+ > >+/* Each Worder Thread Empty Poll Stats */ > >+struct priority_worker { > >+ > >+ /* Current dequeue and throughput counts */ > >+ /* These 2 are written to by the worker threads */ > >+ /* So keep them on their own cache line */ > >+ uint64_t empty_dequeues; > >+ uint64_t num_dequeue_pkts; > >+ > >+ enum queue_state queue_state; > >+ > >+ uint64_t empty_dequeues_prev; > >+ uint64_t num_dequeue_pkts_prev; > >+ > >+ /* Used for training only */ > >+ struct freq_threshold thresh[NUM_FREQ]; > >+ enum freq_val cur_freq; > >+ > >+ /* bucket arrays to calculate the averages */ > >+ uint64_t edpi_av[BINS_AV]; > >+ uint32_t ec; > >+ uint64_t ppi_av[BINS_AV]; > >+ uint32_t pc; > >+ > >+ uint32_t lcore_id; > >+ uint32_t iter_counter; > >+ uint32_t threshold_ctr; > >+ uint32_t display_ctr; > >+ uint8_t dev_id; > >+ > >+} __rte_cache_aligned; > >+ > > Suggest adding a comment on each of the variables above explaining what > the acronym means. > E.g. edpi, ec, pc, ppi. > agree > > >+ > >+struct stats_data { > >+ > >+ struct priority_worker wrk_stats[NUM_NODES]; > >+ > >+ /* flag to stop rx threads processing packets until training over */ > >+ bool start_rx; > >+ > >+}; > >+ > >+/* Empty Poll Parameters */ > >+struct ep_params { > >+ > >+ /* Timer related stuff */ > >+ uint64_t interval_ticks; > >+ uint32_t max_train_iter; > >+ > >+ struct rte_timer timer0; > >+ struct stats_data wrk_data; > >+}; > >+ > >+ > >+/* Sample App Init information */ > >+struct ep_policy { > >+ > >+ uint64_t med_base_edpi; > >+ uint64_t hgh_base_edpi; > >+ > >+ enum queue_state state; > >+}; > >+ > >+ > >+ > >+/** > >+ * Initialize the power management system. > >+ * > >+ * @param eptr > >+ * the structure of empty poll configuration > >+ * @freq_tlb > >+ * the power state/frequency mapping table > >+ * @policy > >+ * the initialization policy from sample app > >+ * > >+ * @return > >+ * - 0 on success. > >+ * - Negative on error. > >+ */ > >+int __rte_experimental > >+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb, > >+ struct ep_policy *policy); > >+ > >+/** > >+ * Free the resource hold by power management system. > >+ */ > >+void __rte_experimental > >+rte_power_empty_poll_stat_free(void); > >+ > >+/** > >+ * Update specific core empty poll counter > >+ * It's not thread safe. > >+ * > >+ * @param lcore_id > >+ * lcore id > >+ * > >+ * @return > >+ * - 0 on success. > >+ * - Negative on error. > >+ */ > >+int __rte_experimental > >+rte_power_empty_poll_stat_update(unsigned int lcore_id); > >+ > >+/** > >+ * Update specific core valid poll counter, not thread safe. > >+ * > >+ * @param lcore_id > >+ * lcore id. > >+ * @param nb_pkt > >+ * The packet number of one valid poll. > >+ * > >+ * @return > >+ * - 0 on success. > >+ * - Negative on error. > >+ */ > >+int __rte_experimental > >+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt); > >+ > >+/** > >+ * Fetch specific core empty poll counter. > >+ * > >+ * @param lcore_id > >+ * lcore id > >+ * > >+ * @return > >+ * Current lcore empty poll counter value. > >+ */ > >+uint64_t __rte_experimental > >+rte_power_empty_poll_stat_fetch(unsigned int lcore_id); > >+ > >+/** > >+ * Fetch specific core valid poll counter. > >+ * > >+ * @param lcore_id > >+ * lcore id > >+ * > >+ * @return > >+ * Current lcore valid poll counter value. > >+ */ > >+uint64_t __rte_experimental > >+rte_power_poll_stat_fetch(unsigned int lcore_id); > >+ > >+/** > >+ * Empty poll state change detection function > >+ * > >+ * @param tim > >+ * The timer structure > >+ * @param arg > >+ * The customized parameter > >+ */ > >+void __rte_experimental > >+rte_empty_poll_detection(struct rte_timer *tim, void *arg); > >+ > >+#ifdef __cplusplus > >+} > >+#endif > >+ > >+#endif > >diff --git a/lib/librte_power/rte_power_version.map > >b/lib/librte_power/rte_power_version.map > >index dd587df..11ffdfb 100644 > >--- a/lib/librte_power/rte_power_version.map > >+++ b/lib/librte_power/rte_power_version.map > >@@ -33,3 +33,16 @@ DPDK_18.08 { > > rte_power_get_capabilities; > > > > } DPDK_17.11; > >+ > >+EXPERIMENTAL { > >+ global: > >+ > >+ rte_power_empty_poll_stat_init; > >+ rte_power_empty_poll_stat_free; > >+ rte_power_empty_poll_stat_update; > >+ rte_power_empty_poll_stat_fetch; > >+ rte_power_poll_stat_fetch; > >+ rte_power_poll_stat_update; > >+ rte_empty_poll_detection; > >+ > >+}; > > checkpatch has several warnings: > > > > ### lib/librte_power: traffic pattern aware power control > > WARNING:LONG_LINE: line over 80 characters > #355: FILE: lib/librte_power/rte_power_empty_poll.c:199: > + poll_stats->thresh[freq].base_edpi / 2000; > > WARNING:LONG_LINE: line over 80 characters > #417: FILE: lib/librte_power/rte_power_empty_poll.c:261: > + poll_stats->thresh[poll_stats->cur_freq].base_edpi); > > total: 0 errors, 2 warnings, 802 lines checked > ERROR: symbol rte_empty_poll_detection is added in a section other than > the EXPERIMENTAL section of the version map > ERROR: symbol rte_power_empty_poll_stat_fetch is added in a section > other than the EXPERIMENTAL section of the version map > ERROR: symbol rte_power_empty_poll_stat_free is added in a section other > than the EXPERIMENTAL section of the version map > ERROR: symbol rte_power_empty_poll_stat_init is added in a section other > than the EXPERIMENTAL section of the version map > ERROR: symbol rte_power_empty_poll_stat_update is added in a section > other than the EXPERIMENTAL section of the version map > ERROR: symbol rte_power_poll_stat_fetch is added in a section other than > the EXPERIMENTAL section of the version map > ERROR: symbol rte_power_poll_stat_update is added in a section other > than the EXPERIMENTAL section of the version map > Warning in /lib/librte_power/rte_power_empty_poll.c: > are you sure you want to add the following: > rte_panic\( > version map checking script(awk script to parse section name) has some issue here. > > Rgds, > Dave. > > > > > >